Mapping secondary data gaps for social simulation modelling: A case study of Syrian asylum migration to Europe

Simulation models of social processes may require data that are not readily available, have low accuracy, are incomplete or biased. The paper presents a formal process for collating, assessing, selecting, and using secondary data as part of creating, validating, and documenting an agent-based simulation model of a complex social process, in this case, asylum migration to Europe. The process starts by creating an inventory of data sources, and the associated metadata, followed by assessing different aspects of data quality according to pre-defined criteria. As a result, based on the typology of available data, we are able to produce a thematic map of the area under study, and assess the uncertainty of key data sources, at least qualitatively. We illustrate the process by looking at the data on Syrian migration to Europe in 2011–21. In parallel, successive stages of the development of a simulation model allow for identifying key types of information which are needed as input into empirically grounded modelling analysis. Juxtaposing the available evidence and model requirements allows for identifying knowledge gaps that need filling, preferably by collecting additional primary data, or, failing that, by carrying out a sensitivity analysis for the assumptions made. By doing so, we offer a way of formalising the data collection process in the context of model-building endeavours, while allowing the modelling to be predominantly question-driven rather than purely data-driven. The paper concludes with recommendations with respect to data and evidence, both for modellers, as well as model users in practice-oriented applications.


Introduction
The aim of this brief report is to propose and reflect on an approach for collating, assessing, and selecting appropriate secondary data for use in an agent-based simulation model of a complex social process.The discussion is illustrated by a case study related to creating, validating, and documenting a model of migration route formation.Our intended contribution is to propose a template for critical assessment of secondary social data and for identifying key knowledge gaps, while remaining honest about the overall uncertainty of social simulation models, and the particular role of data uncertainty in it.
The work is motivated by the need to provide more realistic and reliable input to meet the ever-increasing policy demand for a better understanding of the patterns and drivers of migration.Our particular focus is on the rapidly-evolving asylum flows, where the relevant policies can be aimed at improving the preparedness of the transit and destination countries1 .The discussion is illustrated with an example of modelling Syrian asylum migration to Europe in the 2010s, with the modelling process itself presented in more detail in Bijak et al. (2021).At the same time, in social simulation studies, data quality aspects rarely receive the attention they deserve.This paper is structured as follows.First, in the Methods section, we discuss the interplay and tensions between the supply of available secondary data and the demand for such information to meet the specific needs and research objectives of social simulation models.We focus here particularly on the ways to identify the key information gaps.Subsequently, in the Results section, we present an example of data for a simulation model of migration route formation, based on recent Syrian asylum migration to Europe.The final section concludes with a discussion of key remaining challenges, possible solutions, and practical recommendations for model builders and users.

Methods: Assessing data supply and demand
In empirically-grounded social modelling, a systematic review of the knowledge base typically begins with an assessment of the available secondary information on the topic in question.The foundations for that can be laid by formally creating a comprehensive inventory of data supply, as complete as possible, which aims to assemble key information and meta-information about the different types of data available, the key themes they refer to, and sources they come from.Such an inventory should ideally include meta-information about the classification of data according to a range of characteristics, such as whether the data source is qualitative or quantitative, is it a survey, census, register, observation, or interview, whether the level of aggregation is micro (individuals) or macro (groups), and so on.This meta-information can be either taken from the sources of the data themselves or imputed by analysts.
In the example concerning Syrian asylum migration to Europe in 2011-21, we have considered all publicly available datasets that could be freely accessible online and could potentially provide useful input into the simulation modelling of cross-Mediterranean migration routes.In particular, for data on migration flows and the context in which they occur, we have included a range of databases, online registers, provided by international and national organisations, such as UNHCR, Eurostat, Frontex, or national statistical offices, as well as those resulting from individual research studies, purposefully searched via Google Scholar.The meta-data collection took place in the summer of 2019, and was updated in the autumn of 2021, initially yielding 28 directly-relevant sources (after update increasing to 32), as well as 20 auxiliary ones, listed in Bijak et al. (2021).
The analysis of the limitations of the datasets, including of any biases, or other sources of uncertainty and errors, formed an inherent part of the next steps in evaluating the supply of available data.These steps consisted in establishing a set of quality assessment criteria suitable for the problem at hand.As data quality is a multidimensional construct, with different aspects having varying importance for different users and application areas, the assessment needs to be multidimensional, too.A simple and intuitive quality assessment scheme can use a variation of a traffic-lights approach (green for good data, amber for data with some problems, and red for poor data) to assess each quality dimension according to professional judgement (for related examples, see GAO, 2006, or Vogel andKovacheva, 2008, and for a recent overview of issues with migration statistics generally, see Kraler and Reichel, 2022).Key aspects at initial stages of assembling a data inventory include a comprehensive documentation of all sources and classification decisions, as well as rationale for data assessment.
In the example of data for modelling Syrian asylum migration, we used a five-point traffic-lights scale inspired by the traffic-lights approach, with green-amber and amber-red as interim points.We use one threshold criterion, to what extent the data may be suitable for the problem at hand, followed by six specific criteria.Three of them concern the data generation process as such (its timeliness, the level of data disaggregation, as well as population coverage and adherence to definitions), two criteria refer to the level of trust in data (trustworthiness of the source and transparency of documentation), and one either to completeness (for register-based sources) or sample design (for surveys and other sample-based studies).Finally, we also provide a summary score averaging across the available quality dimensions Bijak et al. (2021).
The analysis of metadata and data quality dimensions can already shed light on the real extent of data supply potentially usable for modelling purposes.A summary of themes covered by different types of available data, filtered by those with a positive assessment of the relevant quality aspects, determines the actual, rather than potential data supply for the problem at hand.For example, by limiting the analysis to the theme of migrant journeys and considering only the sources falling between the green and amber categories overall, with a green rating for transparency and trustworthiness, already considerably limits the number of sources that can be used in the subsequent analysis.This exercise also helps illuminate the data gaps that would ideally need to be covered to meet the needs of modelling, if it is to be empirically grounded.
The other side of the process is related to the assessment of the demand for secondary data to be used in a simulation model.There are two main aspects of this.First, an analysis of data needs, initially at the conceptual level, can provide a rough idea what information would be ideally required for the model to have full empirical basis.Clearly, it is not possible to expect data on all aspects of the model, but this exercise already gives the first approximation about the areas, in which empirical grounding would be ideally needed.It also helps reflect on how the data can be operationalised-qualitatively or quantitatively, and through which variables-and how they can be used in the modelling process-whether to calibrate model parameters or other inputs, or to externally validate model outputs.
Once the initial assessment and modelling has taken place, the second step in assessing the demand for data can consist of an iterative process for identifying the data gaps that can be filled.From the point of view of the modelling process, the ultimate aim is either filling these gaps with secondary data, if available, or through a dedicated collection of primary data on a specific topic.The analytical mechanism which allows this involves sensitivity analysis, especially in the statistical sense, which identifies the model inputs that contribute the most to the model results (Oakley & O'Hagan, 2004).Such inputs are primary targets for additional data collection.This process can be aided by formal modelling of the provenance (origins) of different data sources used in a model, and the relationships between the data and different modules and aspects of the model, identifying the data sources, on which many other elements of the modelled social reality are dependent (Reinhardt et al., 2023).
The identification of data gaps followed by collection of additional information can proceed in an iterative manner to reduce the gap between the demand for data and the available supply, while expanding the latter through addition of new, dedicated datasets.In the end, however, there are always some models' aspects which cannot be calibrated to data, and some free parameters, for which empirical grounding is not possible.They are the key sources of residual uncertainty, which is an important indicator of model performance in its own right, and thus merits a separate analysis (Bijak et al., 2021).

Results: Data on asylum migration from Syria
In the presented example, we aim to study a migration process modelled on asylum flows from Syria to Europe during the 2010s.We use an agent-based approach, simulating individual migrants (agents), their journeys and decisions.The decisions are based on a range of factors known to the agents, but are made under uncertainty, resulting from incomplete information about the world, as well as knowledge and decisions of other agents (for details, see Hinsch & Bijak, 2023).The creation of the data inventory therefore had to encompass both individual-level data on Syrian asylum seekers and factors affecting their decisions, as well as macro-level information about the variables that could enter into the modelling process, either as input to decisions, or as a way of calibrating the model results.The inventory, covering 32 entries and including metadata for individual data sources, as well as their quality assessment, is available in Nurse and Bijak ( 2023) and in a searchable form from Nurse and Bijak (2021).
The data included in the inventory include both quantitative and qualitative sources that either pertain to the process of migrant journeys or the relevant context in which the journeys took place.The data could relate to individuals (micro-level), or be available at the aggregate, population levels (macro).The inventory is purpose-specific and covers only those sources that were publicly available at the time and that were deemed potentially relevant for modelling purposes.For this reason, the inventory may not cover those sources of data on Syrian migration that were either unavailable for research purposes or were found to be unrelated to the modelling objectives.What has been included, typically clustered around four main themes: data on routes and journeys as such, including on migrant decisions; data on the migrant populations, typically in destination countries; data on availability of information during journeys; and key data on relevant policies.The thematic map of the case study data is presented in Figure 1, displaying a word cloud created from the descriptions of all sources included in the inventory.
A more detailed summary of the inventory is offered in Figure 2, which shows the frequencies (counts) of data sources included in the inventory.It is clear that the dominant groups of data sources covered migrant routes and journeys, as well as migrant populations in the destination countries.The data would typically come from either registers or surveys, routine or bespoke, closely followed by interviews with individual migrants.Most data included in the inventory are quantitative, related to migration process as such, and available at the aggregate level, although with a visible minority of qualitative, contextual and micro-level data as well.
The quality assessment for the sources included in the inventory is summarised in Figure 3.By definition of the inventory, all sources included had to be at least partially relevant for the modelling purposes.A clear majority of sources have also scored highly on the transparency and trustworthiness dimensions of the evaluation, as well as, although to a slightly smaller extent, on adherence to the established concepts and definitions.The quality assessment for source-specific criteria (sample design for surveys and completeness for registers) was mixed, with various sources located in different places along the five-point scale.The availability and quality of disaggregated data-for example by age or geographic detail-was typically average.The only more problematic dimension was related to timeliness: many data sources were one-off, being results of specific projects rather than established collection efforts.The overall quality rating indicated that  the available data supply is potentially useful for modelling, especially after some corrections and adjustments.
From the demand side, both the conceptual analysis of the model, as well as the formal statistical sensitivity analysis gave unequivocal results, identifying information as one of the primary factors shaping migrant journeys (Bijak et al., 2021;Hinsch & Bijak, 2023).At the same time, information remains one of the key data gaps and sources of model uncertainty: empirical data about information use by migrants and exchange between migrants are scarce.Usable microlevel data on information are limited to a single source: a comprehensive, albeit just one-off German survey Flucht 2.0 (Emmer et al., 2016).In addition, modelling made use of aggregate estimates of the numbers of arrivals, border apprehensions, missing migrants and fatalities during journeys, originating from international organisations, such as UNHCR, International Organization for Migration, or Frontex (see Bijak et al., 2021).To correct for identified data quality issues, especially the biases (underestimation), the aggregate numbers had to be transformed into relative measures and annual rates of change.
Including empirical data enabled some reduction in the model uncertainty (Bijak et al., 2021).Still, bridging the data gap further would require additional collection of dedicated data, which in the example of modelling Syrian asylum migration has been attempted in follow-up work, both qualitative and quantitative (Belabbas et al., 2022;Bijak et al., 2023), and verifying the remaining modelling assumptions through an extensive sensitivity analysis.In future work, the model construction can be also revisited, in order to provide a closer match to the available empirical basis, and the process can be iterated, for as long as there are any information gains and reduction in the model uncertainty.

Discussion and conclusions
In this note, we have described a process we propose for collating, assessing, selecting, and using secondary data for social simulation modelling purposes.The process, summarised in Figure 4, aims to reconcile the demand for data from the modelling side, with the available supply of reasonable-quality data that can be used to inform the model inputs or calibrate the outputs.It involves creating a dedicated data inventory, carrying out a quality assessment of the data, and analysis of data gaps through the lens of the modelling results.The identified gaps allow focusing on specific areas of further primary data collection and sensitivity analysis.
There are several practical implications for social simulation modellers with respect to the gathering of data and evidence for modelling.First, as summarised in Figure 4, the process of modelling enquiry needs to be driven by a specific research objective and purpose, and clearly acknowledge the limitations and uncertainty of empirical data.Second, data quality and importance in a given model needs to be formally analysed, both conceptually and through statistical means.Third, to facilitate that, data need to be thoroughly documented, for example by using provenance standards (Reinhardt et al., 2023).Fourth, there are important trade-offs between the degree of empirical grounding and the level of detail of mechanistic realism of social simulation models: the more complicated the models, the fewer model elements can have reliable empirical basis.At the same time, any users of social simulation models, especially in policy and practice-related applications, need to be made explicitly aware of data-related limitations.The need to take data uncertainty formally and openly into account when modelling is paramount both for transparency and with respect to the limits of scientific knowledge.This is especially important for such complex and politically charged topics as migration.The article outlines a carefully thought out approach to making sense of data that are of variable quality but carry high levels of political and operational significance.Of particular interest is the transparency the authors provide by reflecting on the types of data under review; the creation of categories that fall between the typical green, amber and red; the decision to prioritise high levels of trustworthiness and transparency among the various criteria.The article is clear about the limits of modelling in this area, which is also well received.For these reasons and more this succinct article is a fine contribution and will hopefully lead to further well-documented, sober reflections on data quality (etc.) in other areas of social science research.

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Areas of research: migration, migration data, migration policy, integration.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
Reviewer Report 16 February 2024 https://doi.org/10.21956/openreseurope.16843.r37139© 2024 Jahani A. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Alireza Jahani
Brunel University London, London, England, UK The paper provides a detailed and well-structured methodology for collating, assessing, and selecting secondary data for social simulation modelling, with a specific focus on Syrian asylum migration.The paper begins by outlining the need for more realistic and reliable data to meet the increasing demand for policy-oriented insights into migration patterns.It emphasizes the importance of assessing data quality and identifying knowledge gaps in the context of modelling endeavours.The structured approach involves creating a comprehensive inventory of data sources, assessing data quality using predefined criteria, and mapping the thematic areas to visualize data uncertainty.
The methodology presented in the "Methods" section is clear and well-defined.The authors rightly highlight the multidimensional nature of data quality, covering aspects such as timeliness, level of aggregation, trustworthiness, and transparency.The use of a traffic-light approach and a summary score provides a practical and intuitive way to evaluate data quality.The inclusion of a case study on Syrian asylum migration enhances the paper's applicability and demonstrates the methodology in action.As an expert in the field that modelled the Syrian conflict forced migration before, I find the approach to be comprehensive, and the paper addresses a critical aspect often overlooked in social simulation studies-the quality and availability of secondary data.I remember we wanted to couple our model with telecommunication dataset as secondary data and faced a lot of challenges.
In the "Results" section, I expected to see more details of the case study, with a focus on the data inventory, thematic mapping, and quality assessment of available data.However, I believe that the identification of data gaps and the iterative process of data collection are particularly valuable insights for researchers and practitioners alike.
The paper concludes with a thoughtful discussion on the challenges, solutions, and recommendations for social simulation model builders and users.It emphasizes the importance of a question-driven rather than purely data-driven approach, acknowledging the inherent uncertainty in social simulation models.The iterative process outlined in the discussion offers a practical way to bridge data gaps and improve model reliability.
Overall, the paper is a valuable contribution to the field of social simulation modelling.It provides a systematic and transparent approach to handling secondary data in modelling endeavours, making it accessible to both experts and non-experts in the field.While the paper is comprehensive, providing more illustrative examples or specific insights from the case study could further enhance its impact and clarity for a broader audience.

Are sufficient details of methods and analysis provided to allow replication by others? Partly
Are all the source data and materials underlying the results available?Yes

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Forced Displacement Modelling and Simulation
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
approach is well presented and the structure and flow of the paper are good with a logical progression from presenting the framework and then unpacking it as a case study.The Syrian refugee example is relevant and serves as a practical example of how the theoretical model can be implemented and operationalized.
The framework described to assess the quality of the data is useful.We would welcome greater explanation of how the traffic light colours were assessed -i.e. are there a series of criteria in each quality dimension that lead to a general score?Or is it an expert assessment of each dataset?Additional detail on this would help readers to operationalize the approach in their own research.
There is also no mention of statistical quality assurance frameworks such as the work by UN stats division (https://unstats.un.org/unsd/unsystem/documents/UNSQAF-2018.pdf) and the European Statistic System (https://ec.europa.eu/eurostat/documents/64157/4392716/ESS-QAF-V1-2final.pdf/bbf5970c-1adf-46c8-afc3-58ce177a0646).These are well researched and describe a broader set of quality dimensions that are relevant to this research.The UNSD SQAF in particular also describes the inherent compromises that need to be made between quality dimensions (i.e.prioritising timeliness over accuracy etc.) which could usefully be mentioned in this paper.The broader set of quality dimensions could also help to provide a set of criteria that could be more broadly applicable to other research.For example, greater mention could be made about using data responsibly.It's implicit in this research example as only public data is used, but in other research some of the qualitative individual data may contain personal data, for which the researchers and ultimately the data owners must take responsibility in assessing whether it is appropriate to publish such data or even to use it in the research.This is well covered in this IASC paper on data responsibility (https://interagencystandingcommittee.org/operationalresponse/iasc-operational-guidance-data-responsibility-humanitarian-action).
A broader set of quality dimensions would also cover data availability / data gaps as well, which are mentioned in the results section.I.e. for the quality dimensions to be truly useful they would ideally help to assess both the available data as well as what is potentially missing.
In terms of terminology, we would suggest to avoid "asylum migration" as a term as it conflates two reasons for people moving.Either the Syrians are seeking asylum in order to become refugees or they are migrating for other reasons.We would suggest "Syrian refugees arriving in Europe" for the article.
Thanks again for the opportunity to review, all work like that expressed in this paper, to operationalise quality improvements to data and data processing are very welcome and we would be interested to collaborate in future activities in this domain.

Are the conclusions drawn adequately supported by the results? Yes
If applicable, is the statistical analysis and its interpretation appropriate?Yes Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Forced displacement population statistics
We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above.

Alejandra Rodriguez-Sánchez University of Potsdam, Potsdam, Germany
This paper presents an approach for collating assessing and selecting appropriate secondary data to build agent-based simulation models of migration.It is applied to the specific case of the Syrian asylum migration to Europ during the 2010s.
Evaluating the information available for modelling analysis is a fundamental aspect of model development.My main take from the paper is on the importance of incorporating data critically in the building of an ABM for migration.
However, here are some recommendations and questions that might improve the manuscript and the exposition of the method: The paper could make clearer how the discussion of the data quality led to decisions in the modeling part.This is left implicit, but readers could benefit from that.
○ Figure 1, a word cloud, is rather uninformative and includes terms that aren't relevant (e.g., rather, since, other, detail etc.).I am not sure if this was the starting point for doing something else on the basis of word frequency, like classifying the types of data sources.

○
The quality assessment of the data sources seems rather ad hoc.Here is perhaps where some notion of a standard could be applicable or at least determined.What is a high quality data source?What is a bad quality?Isn't a bad quality data source useful for modelling in ○ the absence of any other information?On a similar note, depending on the type of data, an evaluation of its quality will have to follow different criteria.How were criteria differently reconciled here?And more specifically, how did you arrive at the score in the table?Would it be possible to describe, at least for some case, how did you come to evaluate a data source as green-amber v. amber?In general, the quality of a data source should be evaluated on the basis of a given model (how useful is it to model that specific element of the phenomenon to be simulated).Is it something like the more detailed the model, the less useful aggregate data might be?
○ If this piece is to serve as a template, as described by the authors, it should have such a structure.The diagram in figure 4 is, in this sense, quite illustrative, but I miss a reflection of each of the elements, perhaps in a separate paragraph, that walks the reader through an example of a data source and how it affected the modelling.

○
What factors would we have to consider if we were interested in modeling other forms of migration or migration in other parts of the world?How specific or transferable are the different steps in this template?
○ All in all, authors are attempting to provide guidance on how to incorporate data into the building of ABMs, which would be a great contribution.

Is the work clearly and accurately presented and does it engage with the current literature? Yes
Is the study design appropriate and is the work technically sound?Partly

Are sufficient details of methods and analysis provided to allow replication by others? Partly
Are all the source data and materials underlying the results available?Yes

Are the conclusions drawn adequately supported by the results? Yes
If applicable, is the statistical analysis and its interpretation appropriate?Yes Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Migration, family demography, causal inference I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Figure 1 .
Figure 1.Textual description of the Syrian migration data inventory in a word cloud.Source: Nurse and Bijak (2021), created with WordItOut.com.

Figure 2 .
Figure 2. Focus, data types, and collection modes in the Syrian migration data inventory (numbers of sources).Source: Nurse and Bijak (2021).

Figure 3 .
Figure 3. Quality dimensions of data sources included in the Syrian migration data inventory (numbers of sources).Source: Nurse and Bijak (2021).

Figure 4 .
Figure 4. Iterative process for assembling and evaluating of secondary data for social simulation modelling.Source: Own elaboration.
Reviewer Report 24 January 2024 https://doi.org/10.21956/openreseurope.16843.r37138© 2024 Rodriguez-Sánchez A. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.