A clustering approach to improve spatial representation in water-energy-food models

Currently available water-energy-food (WEF) modelling frameworks to analyse cross-sectoral interactions often share one or more of the following gaps: (a) lack of integration between sectors, (b) coarse spatial representation, and (c) lack of reproducible methods of nexus assessment. In this paper, we present a novel clustering tool as an expansion to the Climate-Land-Energy-Water-Systems modelling framework used to quantify inter-sectoral linkages between water, energy, and food systems. The clustering tool uses Agglomerative Hierarchical clustering to aggregate spatial data related to the land and water sectors. Using clusters of aggregated data reconciles the need for a spatially resolved representation of the land-use and water sectors with the computational and data requirements to efficiently solve such a model. The aggregated clusters, combined together with energy system components, form an integrated resource planning structure. The modelling framework is underpinned by an open-source energy system modelling tool—OSeMOSYS—and uses publicly available data with global coverage. By doing so, the modelling framework allows for reproducible WEF nexus assessments. The approach is used to explore the inter-sectoral linkages between the energy, land-use, and water sectors of Viet Nam out to 2030. A validation of the clustering approach confirms that underlying trends actual crop yield data are preserved in the resultant clusters. Finally, changes in cultivated area of selected crops are observed and differences in levels of crop migration are identified.


Introduction
The past two decades have seen a resurgence in the practice of national planning. A total of 134 countries, home to almost 80% of the world's population, had national development plans by 2018, up from 62 in 2006(Chimhowu et al 2019. In many developing countries this renewed interest has come as a response to development challenges such as those embodied in the Millennium Development Goals and the Sustainable Development Goals (SDGs), because the comprehensive and complex nature of these goals require a comprehensive planning and monitoring framework (Chimhowu et al 2019).
Incorporating scientific and analytic evidence into decision-making processes is crucial to the success of national planning and policy processes. A systemic approach that helps identify and manage trade-offs while maximising co-benefits is needed to achieve the SDGs (Independent Group of Scientists appointed by the Secretary-General 2019). More than half the countries with national plans have embraced this view and have put emphasis on evidence-based processes, including economic modelling and scenario analysis, to develop their plans (Chimhowu et al 2019).
There is growing recognition of the need to consider interdependencies between water, energy and food sectors for effective resource planning (Miralles-Wilhelm 2016, Smajgl et al 2016, Kurian 2017, Monier et al 2018. This is reflected in the growing number of water-energy-food (WEF) models and frameworks found in the literature. These include a mix of established modelling frameworks and more recent approaches that focus on the WEF resource nexus. Examples of the former include: IMAGE (van Vuuren et al 2015), WITCH (Bosetti et al 2015), REMIND (Luderer et al 2015), and GCAM (Calvin et al 2019); while that of the latter include: NEST (Vinca et al 2020), WHAT-IF (Payet-Burin et al 2019), AWASH (Rising 2020), K-WEFS (Purwanto et al 2021). Endo et al (2020) and Albrecht et al (2018) provide comprehensive reviews of available WEF models and frameworks. The former analyses 25 WEF reviews, while the latter systematically assesses the analytical approaches of 73 WEF studies, and then focus their analysis on 18 selected studies that represent that variety of methods used. However, this and other reviews of available WEF models highlight important gaps that exist in many currently available models and approaches (Kaddoura and El Khatib 2017, Albrecht et al 2018, Shannak et al 2018. The gaps identified include: (a) falling short of capturing interactions between water, energy, and food systems; (b) coarse spatial representation of resource systems; and (c) the lack of reproducible methods of nexus assessment.
Several nexus studies focus on dual-sector interactions e.g. water-energy, water-food, or food-energy. In particular, past approaches were typically watercentric (Smajgl et al 2016). However, in order to improve coordination of cross-sector policies, an integrated analysis that considers water, energy, and food sectors simultaneously and equally is required (Chang et al 2016, Miralles-Wilhelm 2016, Albrecht et al 2018. Such analyses need to be underpinned by similarly integrated models or frameworks. In this respect, currently available WEF models can be broadly divided into soft-and hard-linked models (Brouwer et al 2018). The former is characterised by modelling frameworks that consist of sector-specific modules (CAPRI: Blanco et al 2017; AgMIP: Lampe et al 2014, Ruane et al 2017). Cross-sectoral interactions in these models are represented by sharing of input and output data between modules. An advantage of such an approach is that the dynamics of each sector can be captured in detail. However, this approach has some drawbacks; each of the sectorspecific models will prioritise the optimal configuration of that sector over others and there is scope for inconsistencies between input data and assumptions used in the different sector-specific models or modules. As mentioned above, water, energy, and food sectors need to be considered simultaneously and equally-through a fully integrated approachto sufficiently capture their interdependencies. The approach presented here is a fully integrated (hardlinked) modelling framework.
The spatial and temporal resolution of different nexus models varies greatly in the literature. MuSIASEM (Giampietro et al 2009, Gerber andScheidel 2018) and the WEF Nexus Tool (Water-Energy-Food Nexus Working Group 2014, Daher and Mohtar 2015) apply to a point in time, though MuSIASEM can be applied at different spatial resolutions. Water Evaluation And Planning System (WEAP) and associated models do not include detailed spatial resolution (Sieber and SEI-US Center 2020). Prior climate, land-use, energy, and water systems (CLEWs) studies have used administrative boundaries for spatial disaggregation even though many regions crossed agro-ecological zones (Rogner et al 2009, Fischer et al 2013, Welsch et al 2014. Al-Saidi and Hussein (2021) highlights the need for spatial aspects to be re-examined and better analysed in the WEF nexus. This includes issues such as cross-regional interdependences, trade, globalisation, relocations of production, and international collaboration. Zhang et al (2019a) finds that the spatial scope and number of sectors included in WEF studies is increasing over time but increased resolution is needed. Rasul (2014) highlights the importance of considering inter-regional connections in performing WEF analyses while Zhang et al (2018) finds that the integration of analyses across temporal and spatial boundaries and the evaluation of the combined nexus are critical. Although progress is being made, there is a need to expand the spatial resolution of existing models to address these shortcomings (Aryanpur et al 2021, Martínez-Gordón et al 2021.
Tools and methods must be accessible and reproducible but many of the tools used for WEF analysis are closed source and not openly available. The WEF Nexus Tool provides no information on model details nor how to apply the model to different jurisdictions (Water-Energy-Food Nexus Working Group 2014). WEAP requires a paid license (Sieber and SEI-US Center 2020). MuSIASEM is a framework for analysis and there is no 'tool' that supports the analyses (Giampietro et al 2009, Gerber andScheidel 2018). A number of WEF related studies develop their own code for nexus analysis but do not publish this code open source (Zhang and Vesselinov 2017, Bieber et al 2018, Li et al 2019, Zhang et al 2019b, Sadeghi et al 2020, Yan et al 2020. NEST is open source but its econometric model details have not been made available and the tool requires a commercial solver (Parkinson et al 2018, Vinca et al 2020, Wada et al 2019. The CLEWs framework has a fully open toolset.
In this paper, we present a modelling framework that aims to address some important gaps identified in existing WEF models. We include results from an application of the framework to quantify the crosssectoral interactions between water, energy, and food in Viet Nam. Specifically, we describe a novel modelling framework which incorporates the following features: • Fully integrated approach, considering WEF sectors simultaneously and equally. • Consideration of spatial aspects of WEF resources and trade-offs with computational complexities.
• Flexible, scalable, and accessible, allowing for reproducible nexus assessments across different scales, geographies, and political contexts using open code and data.
The rest of the paper is organised as follows: the components of the modelling framework are described in section 2. This includes details on how the water, land-use, and energy sectors are represented. Further, a clustering tool developed to address cross-scale issues between the land-use and water is presented. Section 2 concludes with a description of the underlying modelling tool-OSeMOSYS (Howells et al 2011)-and the datasets used; section 3 first provides a description of a case study of Viet Nam, where the modelling framework is applied. An analysis of the main results from applying the framework to the selected case study is then presented. It includes a discussion of the main insights that can be derived from this application specifically for Viet Nam and more broadly for other WEF nexus assessments. Section 4 summarises the main conclusions of the study, mentioning some limitations, areas of future work, and recommendations to advance WEF modelling frameworks.

Methods
The methodology developed and applied in this study-presented below-aims to incorporate three main features: a fully integrated representation of WEF systems; an improved consideration of spatial aspects of WEF systems; and the development and use of a flexible, open, and accessible modelling framework.

Clustering tool to aggregate spatial data
The consideration of spatial aspects in WEF models is especially important for the water and land-use sectors. Resource dynamics such as water availability, precipitation, crop suitability, harvested area, landuse type often vary significantly by location (Shannak et al 2018, McGrane et al 2019, Liang et al 2020. In order to capture spatial differences, the water and land-use sectors are represented as sub-regional units or 'clusters' . These sub-regional units can be derived based on, for e.g. a coarse resolution of administrative regions or detailed spatial data on agro-climatic features. While data and computational requirements in the former are relatively low, important spatial features such as soil characteristics and water availability will not be captured. In the latter case, resource dynamics can be captured at high spatial detail. However, incorporating high resolution spatial data in a long-term optimisation model will significantly increase model complexity, data requirements, and the computational burden. In order to reconcile these two considerations, a spatial data aggregation technique is proposed and applied in this study. The spatial aggregation is based on a clustering algorithm that combines 'cells' of high-resolution spatial data into 'clusters' based on their similarity across a set of features. The specific algorithm employed here is Agglomerative Hierarchical clustering using Ward's method (Lance and Williams 1967); in this technique, individual data points are successively combined into 'clusters' with the aim of minimising the variance within each cluster. The total variance-measured as the error sum of squares (ESS)-is calculated as the difference between each data point in a cluster and the cluster average for one or more selected features that the clustering tool is based on. The number of clusters is userdefined and can range between 1 (maximum ESS) and the total number of data points being clustered (ESS = 0). The improvement in accuracy of representing underlying data by increasing the number clusters is quantified in figure 8 (appendix).
The aggregation of spatial data using clustering is well-established and is employed in several geospatial analyses (Ip et  . Several other clustering approaches exist-such as k-means, mean-shift, DBSCAN-each with its own advantages and disadvantages (Rodriguez et al 2019). The main reasons for selecting the hierarchical clustering are: clustering process remains the same for any number of clusters, relatively high accuracy for higher numbers of features, and avoids the need to select starting points.
An Agglomerative Hierarchical clustering algorithm applied in this study is based on crop suitability data across a range of crops (e.g. maize, rice paddy, sugarcane, cassava), under multiple water supply options (rain-fed, irrigated) and input/management levels (low, intermediate, high). The definition of each category is provided in appendix. Each combination of crop, water supply, and management level is referred to as a 'crop combination' here. The clustering tool clusters the 'cells' of land in the country, region, or sub-region based on crop suitability-known as agro-climatically attainable yield-available at 5 arc-min resolution (approx. 10 km × 10 km) across a range of crop combinations. Spatial data on land cover, evapotranspiration, crop water deficit, and precipitation-also at 5 arcmin resolution-are then combined to the clusters based on their coordinates. A summary of the proposed clustering approach is shown in figure 1. The study presented in this paper considers 15 clusters. The choice of number of clusters is informed by a sensitivity analysis on the decrease in total 'error' (ESS) with each additional cluster. This is described in further detail in the appendix (under 'Analysis of clustering results').
Each resulting cluster represents cells of land with similar crop suitability-measured as Figure 1. Summary of proposed Agglomerative Hierarchical clustering approach based on crop yields across a range of crops, water supply options, and management levels. In this example, high-resolution spatial data on land suitability (crop yield in tonnes ha −1 ) for six crops aggregated into 15 clusters. As part of the clustering process, land cover and precipitation data are also aggregated accordingly.
'agro-climatically attainable yield'-across the geographic area of interest. The clustering process can be carried out over a range of different scales (global, regional, national, sub-national) and can be combined with other geospatial considerations such as administrative boundaries or other resource systems such as river basins and renewable energy potential. The clustering tool developed in this study is being maintained as an open-source project 4 .

Model structure and representation of WEF sectors
The clustering process described above allows for high-resolution spatial data to be represented in a modelling framework; a linear optimisation model is used in this study. Each cluster is represented in the model as land that can be allocated to a range of uses, under a set of biophysical constraints. Each cluster has several inputs and outputs that represent the allocation of land and link it other resources, namely water and energy. The model structure of the land and water sectors is shown in figures 2 and 3. Each potential use of a land cluster is associated with different costs, water requirements, energy requirements, water outputs, and crop production. The model's objective function aims to allocate the land to different uses 'optimally' i.e. at the lowest overall cost over the entire modelled period under the given constraints. The total system cost being minimised is broadly divided into three categories: fixed, variable, and capital. Costs in each sector are represented as one of these three categories. For instance, fuel costs and maintenance costs for power plants are represented as variable and fixed costs, respectively. The first is dependent on the usage of a given power plant while the latter is fixed for a given power plant capacity, independent of its level of usage. In the land sector, the cost of installing a new irrigation systemto convert rainfed agriculture to irrigated-is represented as a capital cost that is incurred in the year of construction and is then available for the operational lifetime of the asset. All cost components, representing different categories and sectors, are aggregated as the total system cost in the objective function of the model. This total system cost is minimised during the optimisation process to provide a least cost 'optimal' solution while meeting user-defined constraints and demands.
The irrigation requirements are derived from clustered spatial data for 'crop water deficit' across all crop combinations. Similarly, evapotranspiration and precipitation values from the clustering are also used, the former calculated for all crop combinations. Values for run-off and groundwater recharge are estimated based on a fixed ratio between the two after considering evapotranspiration. The water inputs (precipitation and irrigation) and outputs (evapotranspiration, run-off, and groundwater recharge) are balanced over all clusters. The crop outputs from each cluster are summed up to meet an exogenous total demand.
The model presented here considers all three WEF sectors equally and simultaneously. The representation of the energy sector follows that of traditional energy system models such TIMES-MARKAL (Loulou and Labriet 2007) and MES-SAGE (MESSAGE-IIASA 2020). The energy sector representation comprises a set of power plants,  electricity transmission and distribution, fuel extraction, imports and exports. The model structure of the energy sector is shown in figure 4. The figure shown focuses on the electricity sector. However, the model includes other energy vectors in each demand sector. These include: biomass in the commercial, residential, and agricultural sectors; diesel, petrol, and natural gas in the transport sector; coal, oil, and natural gas in the industry sector. Interlinkages between the water, energy, and land use systems are also represented in the model, as shown in figures 3 and 4. Groundwater recharge and surface water resulting from land based on different land uses is used as an input in the energy and water sectors for cooling powerplants and public water supply, respectively. Similarly, electricity from the energy sector is used as an input for irrigation in the agriculture sector. disaggregated in the model by region, technology-type (e.g. steam turbine, combined cycle gas turbine), and age (existing or planned). These sub-categories are combined here for the purpose of illustration. Linkages between water, land-use, and the energy sector are considered through (1) electricity for irrigation pumping, (2) land requirements for powerplants, (3) cooling water for power plants, (4) water and land-use for biomass production.
The model is calibrated 5 for the period 2016-2018 to match historical data. This includes installed powerplant capacities, energy demand by sector, land cover by land cover type, and crop area by water supply type (irrigated or rainfed).

Model code and data
The framework presented in this study is built using OSeMOSYS (Howells et al 2011), a widely used open source modelling tool. OSeMOSYS is a dynamic, bottom-up, multi-year system modelling tool solving a linear program to determine the least cost investment strategy required to satisfy exogenously defined resource demands. Technical, economic, and environmental implications associated with the identified least-cost systems can be easily extracted from the model results. Like other optimisation models, OSe-MOSYS assumes a perfect market with perfect competition and foresight. OSeMOSYS has been applied across a range of studies at varying geographic scales to address diverse research questions (Welsch et  The proposed modelling framework is developed around open, publicly available datasets with global coverage wherever possible. Take together with the underlying open source model, this will allow the framework to be applied in a wide range of geographical contexts, accessibility of data and assumptions, and reproducibility of nexus assessments. The clustering methodology was applied using high-resolution spatial data from the Global Agro-Ecological Zones (GAEZ) data portal (FAO and IIASA 2021) for land cover, crop suitability across a range of water supply options and management levels, annual rainfall, evapotranspiration, and crop water deficit. However, data from GAEZ is used as an example-the clustering tool can readily use alternate sources of spatial data on these factors wherever available and relevant. This categorisation is significant within the proposed modelling framework which aims to optimise longterm investment decisions in a WEF system such as the cost of shifting from one management practice to obtain higher crop yields.

Results and discussion
In this section we first describe a real-world case study-a WEF model of Viet Nam-where the proposed methodology is applied. The main findings from this case study are then discussed.
Viet Nam's rapid economic growth over the past two decades has been accompanied by soaring demands for energy, water, and land resources (Asian Development Bank 2015). In the WEF model for Viet Nam presented here, the energy sector components were first developed in OSeMOSYS with an outlook to the year 2030. The energy sector representation includes all powerplants in Viet Nam (Vītin, a et al 2017), aggregated by technology categories (e.g. coal, hydropower, natural gas, solar PV, wind), region (North, Central, South), and stage of completion (existing or planned). In total, the model is comprised of 512 'technologies' with 440 'commodities' connecting them. Each year of the model is divided into six representative 'timeslices' based on two seasons and three day-parts (base, intermediate, and peak) in each season.
The land-use and water systems are then added to the model. The first step of this process is to apply the clustering tool to aggregate high-resolution spatial data on agro-climatic features into clusters to be represented in the model. The crops selected to be represented individually in this study are cassava, coffee, maize, rice (paddy), and sugarcane. The remaining food crops are combined as vegetables and fruits. The land suitability for each crop is considered under two water supply options (irrigated and rain-fed) and three management levels (low, intermediate, high).

Comparison of crop yield values
The primary aim of the proposed clustering tool is to reconcile the need for a spatially resolved WEF model with the computational complexity of optimising such a model. The clusters resulting from the tool must therefore accurately represent trends of the actual values in the high-resolution spatial data. The WEF model utilises these clustered crop yield values to represent competition for land between different crops. Crop yields can typically be improved by investing in irrigation systems or improved cropland management or a both. The extent to which crop yields can be improved through the above options varies by both crop and geographic region. Crop yield values for different crop combinations-i.e. crops under different water supply options and management levels-is shown in figure 5. These values are compared to those resulting from the proposed clustering tool.
The figure shows that the resultant clusters capture both the dominant and extreme values within the actual spatial data. Moreover, these trends are captured across all crop combinations. This can be seen by comparing the actual and clustered values in the figure on three measures: (a) range and distribution of values; (b) median values, and; (c) preservation of trends across crop combinations; the third point is especially important to capture since investment decisions in the WEF model are based on the potential improvement in yields across the crop combinations.
The yield of Cassava, for instance, can be improved from around 3 tonnes ha −1 under rainfed conditions and low management level to around 10 tonnes ha −1 under irrigated conditions and high management level (both median values). A similar tripling of attainable crop yield is observed across the remaining crops. This trend in the actual crop yield data is preserved in the clustered crop yield values as well.

Land-use change
The use of high-resolution spatial data allows for other insights in addition to the national level results discussed above. For instance, it allows for an assessment of potential crop migration to minimise the total system cost over the modelled time horizon. Model results for land-use change by land-use type are shown in figure 6. It can be observed that areas under cultivation for a crop show one of the following trends: significant increase in all main areas of existing cultivation (low crop migration) or; significant increase in some main areas of existing cultivation and decrease in others (moderate crop migration) or; shift in main areas of existing cultivation from one to another.
Sugarcane, vegetables and fruits fall into the category of low crop migration; the main areas of existing cultivation for each crop-North for sugarcane and South-Central for vegetables and fruits-are further consolidated in the future. The former crop going from occupying land shares of around 10%-15% in 2020 to around 20%-35% by 2030. In the latter crop's case, the land share in occupied clusters increases from around 15% in 2020 to almost 30% by 2030. Meanwhile, cassava, maize, and rice follow a moderate crop migration trend; the land share allocated to cassava increases in the North-Central region decreases in North and South during the same time period. Similarly, maize cultivation decreases in the South-Central region but more than doubles in parts of the North, from around 20% in 2020 to over 40% by 2030. Rice is a widely cultivated crop in Viet Nam but particularly so in the Red River Delta in the North-East and the Mekong River Delta at the Southern end of the country. The land share allocated to rice cultivation is further increased in these fertile delta regions. At the same time, some decrease in land share is observed in some parts of the North. This is likely due to competition with other crops such as sugarcane as well as other land-uses such as forests   and grasslands in these regions. 'Competition' in this case is represented by assigning a monetary value to forests and grasslands that is assumed to represent the value of ecosystem services they provide. Finally, coffee shows a trend of high crop migration; the main areas of cultivation shift from the North-East to more suitable areas in the North-Central regions.

Sector interactions
The WEF model provides insights on the evolution of energy, water, and land use systems over the modelled time horizon. Figure 7 below shows the main results for each sector as well as that of interlinkages between sectors. The electricity generation mix is seen to move from a system that is evenly split between natural gas, hydro and coal to one that is natural gas dominated. This is as a result of increasing use of domestic gas in Vietnam. An increasingly fossil-fuel dominated system however results in a significant increase in water demand for cooling powerplants, from 8 billion cubic meters (BCM) in 2016 to around 21 BCM in 2030an increase of 160%. Further, it results in a dramatic increase in CO 2 emissions; the power sector goes from emitting 70 million tonnes of CO 2 in 2016 to 187 million tonnes of CO 2 by 2030.
The land use sector, in the meantime, shows a significant expansion of agricultural land to meet a rapidly growing population as well as to meet Vietnam's ambitious crop export targets. Total agricultural land increases from 142 000 sq. km in 2016 to 214 000 sq. km in 2030-an increase of 51%. This expansion in agricultural land comes at the cost of decreasing forests and grasslands, each declining by 50% and 44% between 2016 and 2030, respectively. Further, the increase in agricultural land results in growing water demands for irrigation, which increases from 18 BCM in 2016 to 35 BCM in 2030an increase of 89%.
These results highlight important sectoral and intersectoral dynamics that can be analysed through a WEF model. The scenario studied was a baseline or 'business-as-usual' scenario where the main policy is a relatively un-ambitious renewable power development master plan (PDP 7). The RE target in this plan is easily reached with hydropower alone, requiring little additional investment in other renewables such as wind and solar power (each representing around 3%-4% of generation capacity by 2030). A 'renewables-led pathway' would lead to around 1.1 gigatons fewer greenhouse gas emissions by 2030. In addition, it would also significantly lower sulphurdioxide and nitrogen-oxide emissions. In the current power development master plan (PDP7), the significant increase in emissions would have a serious impact on Vietnam's air quality, which is already among the worst in Southeast Asia.
Crop demands are fixed in the modelling study presented. This is indeed a simplification. However, the model itself is able to represent dynamic crop demands that respond to-among other factorsexport prices, changing yields, and water availability. There is no technical limitation preventing these from being included in a future iteration of the model. The focus in this model version is on improving the spatial aspects of the proposed approach for modelling a WEF nexus. Further, the results from the study do point to a tendency of crops to respond to changing yields across Viet Nam. Therefore, studying the impact of other factors on crop selections and level of agricultural modernisation.
The total projected water demands in the model (56 bcm) can be met with the total renewable water resources (RWRs) in Viet Nam (884 bcm). However, the spatial distribution of demands and resources needs to considered in further detail to identify potential sources of unsustainable water withdrawals. In addition, 59% of Viet Nam's RWR is from external sources. This leaves it vulnerable to changes-both natural and man-made-in upstream water flow. This is already being seen in the increasing salination of freshwater resources downstream in places such as the Mekong River Delta. However, the representation of water dynamics such as salination is not yet included in the presented model.
The aim of such a model is to allow for more extensive scenario analyses that help identify policies and actions that can positively impact multiple sectors simultaneously. No such specific policies were included in the presented study. However, the largescale deployment of wind power has the potential to do so. It would lead to a shift from high emission power generation capacity. This would include the additional benefit of reducing cooling water requirements for fossil fuel power plants. On the land use side, the parts of Viet Nam with the highest wind power potential are the highlands which are locations with less likelihood of competition for land with rice cultivation, the major crop of the country.

Conclusions
This paper presented an open-source modelling framework to conduct fully integrated analyses of water, energy, and food systems. A land use clustering tool was developed to support the integrated modelling framework. The modelling framework, and clustering tool, was used to carry out a long-term analysis of Viet Nam's WEF nexus.
The clustering tool addresses the gap of improving the representation of spatial data in traditional system optimisation models. The clusters, which are derived based on similarities in crop yields across a range of crops, are found to represent the range of actual underlying data well. This allows a balance between model complexity and accuracy in representing high resolution spatial data. The results related to crop migration shown here align with findings from other studies that assess crop migration in the context of climate change (Sloat et al 2020). It suggests that crop migration in Viet Nam allows for an increase in productivity and potentially avoiding the most damaging impacts of land use change.
The clustering tool, however, has some important limitations. First, it does not consider seasonal differences in crop yields and multi-cropping methods. It uses the annual average agro-climatically attainable yield for each crop at a certain location under different conditions. Second, the environmental impacts of agricultural inputs such as fertilisers are not included. The modelling framework allows for inclusion of these aspects and is a planned next step of the model development. Finally, soil carbon dynamics associated with land use change are not assessed. While certainly important, given the asymmetry between the loss of soil carbon and the accumulation of carbon under different management options, it is considered extremely difficult to assess (Paustian et al 2019).

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: 10.5281/ zenodo.5348228. some mechanisation, is medium labour intensive, uses some fertiliser application and chemical pest disease and weed control, adequate fallows and some conservation measures.

High level inputs
Under a high level of input (advanced management assumption), the farming system is mainly market oriented. Commercial production is a management objective. Production is based on improved or high yielding varieties, is fully mechanised with low labour intensity and uses optimum applications of nutrients and chemical pest, disease and weed control.

Analysis of clustering results
The aim of any clustering process is to combine similar data points while not losing important differences within the underlying data. As mentioned in the section 2, the difference between average value of a cluster and its underlying data is measured as the error sum of squares (ESS). Aggregating all data points in a single cluster result in the maximum ESS. Conversely, ESS = 0 when the number of clusters is equal to the number of underlying data points. Each data set follows a different trend of decreasing ESS when increasing the number of clusters. This trend can be captured using an 'Elbow graph' , as shown in figure 8 below. This 'Elbow graph' is for the input data set for the Viet Nam (VNM) WEF model presented in this paper.
With each additional cluster, the total ESS (measured as a % share of maximum error) decreases. The marginal gain in accuracy of representation is highest when the number of clusters is low and decreases thereafter. There is a noticeable change in the marginal gain from cluster 5 onwards as compared to clusters 1-5. This results in an 'elbow' , which gives this type of figure its name.

Computational tests
Below is a comparison of the computational effort for three different test models, each with a different number of clusters. Table 1 also includes details on the total matrix size which indicates the size of the 'problem' to be solved.

Comparison of clustering with counterfactual (no clustering)
In this section, results of a counterfactual (no clustering) are compared with that of 15 clusters. The series of figures below show crop yield (normalised) for a range of crop combinations. Each line in a figure  represents crop suitability for a cell of land included in that cluster. For the case of no clustering, the lines represent all cells of land in the country.

No clustering
In the counterfactual case (with no clustering), all data is aggregated at the national level. It is clear from the figure below that crop suitability varies significantly between cells of land for different crops. These important differences would therefore not be captured in a model that uses un-clustered data as an input.

clusters (as used in the model presented in this paper)
Below is a series of figures (figures 10-24) showing the results of the clustering process. It is clear from the figures-when compared to the counterfactual case-that the cells of land with similar crop suitability across different crop combinations are clustered together. This is highlighted by the fact that the lines that were scattered more widely in the counterfactual case (figure 9), they are bunched more closely together when clustered (into 15 clusters in this case).