Supporting energy system modelling in developing countries: Techno-economic energy dataset for open modelling of decarbonization pathways in Colombia

Decarbonization pathways have emerged as a pivotal element of global climate change mitigation strategies. Energy system modelling is widely recognized as a tool to support the design of informed energy decarbonization policies. However, the development of energy models heavily relies on high-quality input data, which may pose significant challenges in developing countries where data accessibility is limited, incomplete, outdated, or inadequate. Moreover, while models may exist in countries, these are not publicly available; therefore, details cannot be retrieved, repeated, reconstructed, interoperable or auditable (U4RIA*). This paper presents an open techno-economic energy dataset for Colombia that is U4RIA-compliant as it can be used transparently to model decarbonization pathways and support energy planning in the country. Despite being country-specific, most of the data is technology-based and thus applicable to other countries. Diverse sources, assumptions, and modelling guidelines are described to facilitate the creation of new datasets. The dataset enhances the availability of energy data for policymakers, stakeholders, and researchers, not only in Colombia but also in other developing countries.


a b s t r a c t
Decarbonization pathways have emerged as a pivotal element of global climate change mitigation strategies. Energy system modelling is widely recognized as a tool to support the design of informed energy decarbonization policies. However, the development of energy models heavily relies on high-quality input data, which may pose significant challenges in developing countries where data accessibility is limited, incomplete, outdated, or inadequate. Moreover, while models may exist in countries, these are not publicly available; therefore, details cannot be retrieved, repeated, reconstructed, interoperable or auditable (U4RIA * ). This paper presents an open techno-economic energy dataset for Colombia that is U4RIA-compliant as it can be used transparently to model decarbonization pathways and support energy planning in the country. Despite being country-specific, most of the data is technology-based and thus applicable to other countries. Diverse sources, assumptions, and modelling guidelines are described to facilitate the creation of new datasets. The dataset enhances the availability of energy data for policymakers, stakeholders, and researchers, not only in Colombia but also in other developing countries. ©

Value of the Data
• This dataset can be utilized to develop energy system models and assess decarbonization pathways for Colombia. Depending on the design of the modelling process, other policy insights can also be obtained. • The dataset covers the entire chain of the energy system in Colombia, which is inexistent in the current literature. • Analysts, policymakers, and the scientific community can employ the dataset and the methods described for conducting energy studies not only in Colombia but also in countries with similar characteristics. • The design of this dataset can serve as a benchmark for similar studies in energy system modelling, promoting the adoption of open data and transparency.

Objective
The provision of this open dataset is expected to promote greater transparency, collaboration, and knowledge sharing among the research and modelling communities, thereby advancing the state-of-the-art in energy modelling and contributing to more effective policymaking. This effort is in line with the U4RIA goals [1] , which encompass Ubuntu, Retrievability, Reusability, Repeatability, Reconstructability, Interoperability, and Auditability. Furthermore, this dataset can serve as an archetype for future energy system modelling assessments in developing countries.

Data Description
The data provided in this paper were gathered for the assessment of decarbonization pathways in Colombia using the OSeMOSYS framework [2] . However, the data available through this document are independent of the tool. The dataset presented was collected from websites, reports, and databases of international organizations and national entities, as well as from academic articles. It includes historical and/or projected data of end-use demands, capital and operating costs, efficiencies, operational lifetimes, capacity factors, residual capacities, emission factors, and energy availabilities. The dataset has been made openly accessible in the Mendeley Data Repository and can be accessed via the following link: https://data.mendeley.com/datasets/wmh4kz59wz/1 . For better understanding, technologies have been divided into 10 categories, covering the entire chain of the energy system from primary energy supply to end-use demands (see Fig. 1 ). The complete list of technologies is available in the repository in the Excel file SETS, under the sheet TECHNOLOGY.

Demands
The modelling included 37 end-use demands in different sectors. For instance, we represented energy demand for cooking services in the residential sector and energy demand for public passenger transport by taxi. End-use demands for 2021 were obtained from the national useful energy balance [3] . The projected demand data were calculated based on the expected growth of the gross domestic product (GDP). Table 1 shows an excerpt of the end-use demand   data for key years. The complete end-use demand data are available in the repository in the Excel file MODEL DATA, under the sheet DEMAND.

Capital Costs
The capital cost data represent overnight costs from 2021 to 2050 for different technologies. Projected data of capital costs were considered when available, otherwise constant values were assumed. Table 2 shows an excerpt of the capital cost data for selected technologies and key years. The complete capital cost data are available in the repository in the Excel file MODEL DATA, under the sheet CAPITAL COST.

Fixed Costs
Fixed costs represent operational and maintenance costs that are independent of the activity of technologies. Projected data of fixed costs were considered when available, otherwise constant values were assumed. Table 3 presents an excerpt of the fixed cost data for selected technologies and key years. The complete fixed cost data are available in the repository in the Excel file MODEL DATA, under the sheet FIXED COST.

Variable Costs
Variable costs represent the fuel costs in the case of primary energy supply and import technologies, and the variable non-fuel costs for the rest of the technology categories. The crude oil extraction cost included the cost of transport to the refinery [4] . The costs of imports were gathered from estimations performed by the Unit of Mining and Energy Planning (UPME [5] ). Table 4 presents an excerpt of the variable cost data for selected technologies and key years. The complete variable cost data are available in the repository in the Excel file MODEL DATA, under the sheet VARIABLE COST.

Emissions Factors
We considered equivalent emission factors that include carbon dioxide (CO 2 ), methane (CH 4 ), and nitrous oxide (NO 2 ). The emission factor data were obtained from [6] , which summarize emission data from the Intergovernmental Panel on Climate Change (IPCC) and domestic data of national studies. Table 5 presents the consolidated data used for emission factor calculations as described in Section 3.5 . For CCS technologies, an efficiency of 90% in capturing CO 2 emissions is assumed [7] , and emission factors by technology are recalculated accordingly. The limit of geological storage for CO 2 is estimated at 360 MtCO 2 considering the potential of CO 2 injection for enhanced recovery in oil and gas reservoirs [8] . The complete emission factor data are available in the repository in the Excel file MODEL DATA, under the sheet EMISSION FACTOR.

Operational Lifetimes
The operational lifetime represents the standard value of a technology's lifespan in number of years. Table 6 presents an excerpt of the operational lifetime data for selected technologies. For technologies with no capital or fixed costs, a default lifetime of 100 years is assigned. The complete operational lifetime data are available in the repository in the Excel file MODEL DATA, under the sheet LIFETIME.

Efficiencies
Efficiency in the modelling process represents the ratio between output energy and input energy. Efficiency data is collected from multiple sources, as described in Section 3.7 . Due to uncertainty, efficiencies were assumed to be constant. When data was unavailable, efficiency was assumed to be equal to 1. Table 7 shows an excerpt of the efficiency data for some technologies. All technology efficiencies are expressed as a percentage, except for transport technologies, whose efficiency is expressed in Gpkm/PJ for passenger transport and Gtkm/PJ for cargo transport. In the case of crude oil refining, the output ratios for the different petroleum derivatives are estimated from [9] . Table 8 presents the output ratios for the refinery technology. The complete efficiency data is available in the repository in the Excel file MODEL DATA, under the sheet EFFICIENCY.

Capacity Factors
Capacity factor is defined as the ratio of energy produced by a generating unit for the period considered to the energy that could have been produced at continuous full operation during the  same period. For power generation and other conversion technologies, the capacity factors were reported using different sources and considerations as depicted in Section 3.8 . Table 9 presents an excerpt of the capacity factor data for power generation technologies. For demand technologies, we assumed that installed capacity is fully available, and capacity factor is equal to 1. The complete capacity factor data is available in the repository in the Excel file MODEL DATA, under the sheet CAPACITY FACTOR.

Residual Capacities
Residual capacity represents the installed capacity of a technology each year. Projected data of power generation included the planned power plants from 2022 to 2026, as detailed in the repository's Excel file SUPPLEMENTARY DATA, under the sheet ADDITIONAL PP. Other installed capacities for conversion technologies were assumed to remain constant until 2050. On the demand side, installed capacities were decreased linearly based on the technology's operational life. Table 10 presents an excerpt of the residual capacity data for selected technologies. The complete residual capacity data is available in the repository's Excel file MODEL DATA, under the sheet RESIDUAL CAPACITY.

Annual Potentials and Reserves
We have considered fossil fuel, nuclear, biomass, and renewable energy as categories of primary energy supply. Fossil fuel reserves were reported in volume and mass quantities according to official government data. Potential reserves of uranium for nuclear energy production were also included. The estimated potential of renewable energy technologies was derived from technical studies. Annual biomass production was estimated based on assumptions described in Section 3.10 . Table 11 provides a description of the availability of primary energy resources by category. The data is also available in the repository's Excel file MODEL DATA, under the sheet RESERVES-POTENTIALS.

Experimental Design, Materials, and Methods
The dataset was compiled through a comprehensive literature review. Data was gathered from websites, reports, and other databases of international organizations and national entities, as well as from academic articles. The raw data was organized, analyzed, processed, and standardized according to the requirements of the modelling. We provide detailed information on the data sources, assumptions, and processing methods implemented in the construction of the dataset in the following sections.

Demands
We obtained the end-use demands for 2021 from the national useful energy balance [3] . In the case of the transport sector, the end-use demand was estimated using data on consumed energy per vehicle type from [3] , and vehicle efficiency from [18 , 19] , as shown in Eq. (1 ). The projected demand data were calculated by multiplying the baseline 2021 demands by the expected percentage increase in GDP. For 2022, the percentage increase was set at 8% due to the rebound from the COVID-19 pandemic [20] . For the period 2023-2050, the rate was estimated at 3.2% per year based on the average GDP observed during the period 2012-2019 [21] .

Capital Costs
Capital cost data were collected from various sources, as summarized in Table 12 . The projected costs were taken directly from the literature sources and were not calculated. All costs were converted to 2021 USD using the average euro-dollar exchange rate from [22] and the inflation rate based on [23] .  [27 , 30] and estimations described below Industry demand [27 , 30 , 31]

Residential demand
Estimations based on commercial prices in the Colombian market Transport demand [29 , 32 , 33] Commercial and public demand [30] and estimations based on commercial prices in the Colombian market  14.91 [39] 10 0140 0 0 5043.08 [37] 29.61 Natural gas 23.7 [40] 553160 0 0 45208.33 [11] 29 We addressed the lack of data by making the following estimations: A. Transportation and distribution of fossil fuels were represented by single technologies to capture the capacity and cost of expansion. To account for the lack of information and to avoid the requirements of locations and distances, a method to quantify the cost per unit of capacity was implemented based on [34] . We multiplied the variable cost of transport and distribution by the total annual demand of the energy carrier, and then divided it by the total installed capacity for the reference year 2021. Eq. (2 ) shows the calculation for this process, and (2) B. Capital costs of transport technologies were converted from USD per vehicle to USD per passenger-kilometre or USD per tonne-kilometre. We divided the unit cost of each vehicle type [32] by the product of the activity factor and the occupancy factor [18] . Eq. (3 ) shows the structure of this calculation. For battery electric vehicles (BEV), plug-in hybrid electric vehicles (PHEV), and fuel cell electric vehicles (FCEV), a conservative approach was assumed, considering cost parity with internal combustion engine (ICE) technologies by 2050 based on [32] . Complete data on transport costs are available at the repository in the Excel file SUPPLEMENTARY DATA in the sheet TRANSPORT COST.
C. The capital costs of residential technologies were estimated using the commercial prices on the webpages of the major retail companies in Colombia. Low efficiency appliances  [28][29][30] Transport and distribution [30 , 41] and estimations described below Industry demand [27 , 30 , 31] Transport demand Estimations based on [14] Commercial and public demand [30] were categorized as C, D, E, F, and G according to the Technical Labelling Regulation (RE-TIQ, Spanish abbreviation), which is the national regulation that certifies the efficiency level of equipment in Colombia. High efficiency appliances belong to categories A and B according to the RETIQ. Eq. (4 ) presents the calculation to estimate the cost per unit of capacity. Complete data on residential costs are available at the repository in the Excel file SUPPLEMENTARY DATA in the sheet RESIDENTIAL COST.

Residential technology cost
D. Due to the lack of available data for furnaces and boilers coupled with CCS technologies in the industry sector, capital costs were estimated using the differential costs in power plants. For instance, if the capital cost of a coal power plant is 2400 USD/kW, and the capital cost of a coal power plant with CCS is 4600 USD/kW, then the differential cost is 2200 USD/kW. This differential cost is added to the capital cost of a coal furnace and a coal boiler to represent the coal technologies with CCS in the industry sector.

Fixed Costs
Fixed cost data were gathered from several sources, as summarized in Table 14 . The projection of costs was taken from the same literature sources and was not calculated. All costs were converted to 2021 USD using the average euro-dollar exchange rate from [22] and the inflation rate based on [23] .
In transport and distribution technologies, we assumed a fixed cost of 3% of the capital cost annually. For ICE and FCEV passenger vehicles, a fixed cost of 3% of the capital cost was considered, while for BEV and PHEV, it was assumed to be 1%, except for buses and microbuses. For buses, microbuses, and freight transport, a fixed cost of 2% of the capital cost was used for all technologies. These assumptions were based on [14] . For other transport modes, we assumed a fixed cost of 3% of the capital cost. For furnaces and boilers coupled with CCS technologies in the industry sector, we applied the same consideration described previously, and fixed costs were estimated using the differential costs in analogous power plants.

Variable Costs
Variable cost data were gathered from several sources, as summarized in Table 15 . The projected data of fossil fuel imports were obtained from an assessment by UPME for the period between 2021 and 2037, under the reference scenario [5] . For the period 2038-2050, the data was extrapolated using a linear trend. Domestic production costs of primary energy resources were assumed to be constant. We also included a variable cost of transport and storage of CO 2 equal to 36.1 US$/t [7] , added to CCS technologies to represent the financial cost of CO 2 infrastructure. For furnaces and boilers coupled with CCS technologies in the industry sector, we applied the same consideration described previously, and variable costs were estimated using

Table 16
List of sources for operational lifetime data.

Residential demand
Estimations based on commercial prices in the Colombian market Transport demand [29 , 32 , 33] Commercial and public demand [30] and estimations based on commercial prices in the Colombian market the differential costs in analogous power plants. All costs were converted to 2021 USD using the average euro-dollar exchange rate from [22] and the inflation rate based on [23] .

Emissions Factors
The emission factor data were gathered from [6] , and the calculations for finding the equivalent emission factors in terms of CO 2 e were conducted using the global warming potentials (GWP) described by [43] . Eq. (5) shows the structure of the calculation mentioned. For the modelling approach, emissions were calculated as a product of the technology activity level and the emission factor, thus the emission factor by technology will depend on the efficiency of the technology. Eq. (6) presents the basic calculation to find the emission factor by technology. Biomass resources were considered carbon neutral [44] , and bioenergy technologies coupled with CCS (BECCS) were allocated the respective negative emissions.
Equi v alent emission f actor T echnology e f f iciency (6) 3.6. Operational Lifetimes Table 16 summarizes the data sources used for gathering operational lifetime values. When data were unavailable, reasonable values were assumed based on similar technologies. The lifetimes for road transport technologies were adjusted considering the lack of regulation for maximum age and the average ages of vehicles in Colombia [45] .

Table 17
List of sources for efficiency data.

Efficiencies
Efficiency data were gathered from different sources as summarized in Table 17 . For nonhydrogen industry technologies and cooking technologies, the efficiencies are estimated from the national useful energy balance [3] . For furnaces and boilers coupled with CCS technologies in the industry sector, efficiencies were estimated using the differential efficiency in analogous power plants. In transport and distribution technologies of hydrogen and fossil fuels, we assumed an efficiency of 1 owing to a lack of data. For blending technologies, the mix percentage is set at 10% for both bioethanol and biodiesel [48] .

Capacity Factors
The average annual capacity factors of solar PV and onshore wind technologies were estimated using the generation and capacity information in the period 2015-2021 [50] . The annual power generation reported is divided by the theoretical power generation assuming that the installed capacity works 100% of the time, as described by Eq. (7 ). The capacity factors for other power and conversion technologies were obtained from reports and literature assessments as summarized in Table 18 . We assumed that end-use technologies are fully available to supply the demand and thus capacity factors are equal to 1.

Residual Capacities
Installed capacity of power plants was collected from the market operator XM for centralized generation [51] and from the Promotion and Planning Institute for Energy Solutions (IPSE) for decentralized energy [52] . Fossil fuel processing and refining capacities were gathered from different sources, as shown in Table 19 . Transport and distribution installed capacities are summarized in Table 13 . Installed capacity of power transmission and distribution was obtained from [14] , and installed capacity of recharging stations was estimated by considering 491 chargers with 50 kW each [53] . For end-use technologies, we estimated the residual capacities in 2021 by assuming full use of installed capacity to supply demand. Eqs. (8 ) and (9) present the way of calculating the technology residual capacity via the energy consumed by technology and the  (9 ) and (10) are the same for cargo transport technologies using units of Gtkm/year Regarding the projected residual capacities, we included the planned power plants from the renewable energy auctions in 2019 and 2021 [56 , 57] , the Hidroituango project, and other committed projects [58] . Phase-out power plants were not considered due to information unavailability, and other residual capacities for conversion and transport-distribution technologies were assumed constant. For end-use technologies, we used simplified mortality lines, where residual capacity decreases linearly according to the operational lifetime until reaching zero, as shown by Eq. (11 ). In the industry sector, we considered constant residual capacity until 2025 and then the mortality line was applied. For other demand sectors, the reduction in installed capacity started from 2022. The calculations depend on the assumption of capacity factors equal to 1 in the end-use technologies. Improved estimations of installed capacities and projections are possible if technology inventory data are available.

Annual Potentials and Reserves
Crude oil and natural gas reserves in 2021 were 2039 Mbl and 3.2 Tcf respectively [59] . We considered projected reserves for the period 2021-2050 equal to 5704 Mbl and 10.9 Tcf based on the intermediate scenarios of future availability of fossil fuels assessed by UPME [10 , 11] . These values are highly uncertain but are conservative when considering the historical incorporation of new fossil fuel reserves in the past 14 years [59] . Solar PV and onshore wind estimations considered regional energy potential and availability of land for power plant deployment [ 15 ]. Hydro potential was obtained from a national estimation [14] . Geothermal potential was estimated based on hot springs [17] and did not consider reconversion of oil wells to geothermal wells. The national roadmap of offshore wind energy was used to obtain data on potential installed capacity of the technology [16] . Biomass primary supply was estimated using data from  [14] , assuming a potential land area of 10 0 0 kha and the present energy crop yields. Fuelwood was limited using the data available for a crop of Eucalyptus. Bagasse potential was estimated using a residue to product ratio of 0.31 with respect to sugarcane potential. Table 20 summarizes the data used in the biomass estimations.

Ethics Statement
Not applicable.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability
Techno-Economic Energy Dataset for Open Modelling of Decarbonization Pathways in Colombia (Reference data) (Mendeley Data).