Energy The importance of open data and software: Is energy research lagging behind?

Energy policy often builds on insights gained from quantitative energy models and their underlying data. As climate change mitigation and economic concerns drive a sustained transformation of the energy sector, transparent and well-founded analyses are more important than ever. We assert that models and their associated data must be openly available to facilitate higher quality science, greater productivity through less duplicated e ﬀ ort, and a more e ﬀ ective science-policy boundary. There are also valid reasons why data and code are not open: ethical and security concerns, unwanted exposure, additional workload, and institutional or personal inertia. Overall, energy policy research ostensibly lags behind other ﬁ elds in promoting more open and reproducible science. We take stock of the status quo and propose actionable steps forward for the energy research community to ensure that it can better engage with decision-makers and continues to deliver robust policy advice in a transparent and reproducible way.


Introduction
For nearly a century, the global energy system has remained remarkably stable, powered largely by fossil fuel combustion. However, successfully addressing anthropogenic climate change with low-carbon technologies requires that we fundamentally alter energy supply and demand in the 21st century, yet the pathway and outcomes of this transformation are highly uncertain. For example, rapid improvements in solar photovoltaics and batteries coupled with information technology may point towards a more distributed energy system with its design actively shaped by consumers. Alternatively, large-scale technologies like nuclear, biomass, carbon capture and storage or wind may extend the dominance of a centralised power system.
Given the uncertainty and complexity of the energy system, quantitative models are one of the few available tools that allow analysts to explore alternative scenarios and help guide public policy. Quantitative analysis from energy models underpins much of academic research and energy policy-making (Strachan et al., 2009). Yet most models and data relied upon by utilities, consultancies and public research institutes remain inscrutable "black boxes"whether econometric models with a small number of parameters, or large linear optimisation models with hundreds of thousands of input variables. In contrast to closed models, "open" models imply that anyone can freely access, use, modify, and share both model code and data for any purpose (Open Knowledge Foundation, 2015). Here, we (1) argue why energy data and models urgently need to become open; (2) discuss the key reasons why many are currently not; (3) examine whether energy research is lagging behind other fields in becoming more open; and finally (4) outline specific issues for individuals to consider and propose next steps for the energy research community.

Why models and data should be open
Given the critical guidance that energy models and data provide to decision makers, they should be made open and freely available to researchers as well as the general public. There are four specific reasons for this: 1. Improved quality of science. Fundamental scientific principles such as transparency, peer review, reproducibility and traceability are almost impossible to implement without access to models and data (DeCarolis et al., 2012;Nature, 2014). Better adherence to these principles leads to higher quality science. Researchers are fallible human beings and errors are inevitable under pressure to deliver. Such mistakes can have profound implications. For example, the Reinhart-Rogoff spreadsheet error arguably skewed the international debate on austerity (Herndon et al., 2014). Such incidents serve as warnings against poor programming practices, such as a lack of auditing as well as closed models and data: it was only through sharing the spreadsheet that the errors were discovered. 2. More effective collaboration across the science-policy boundary. Better and more transparent science ought to enable better policy outcomes, but the issue is more complex than that. Academic peer review routinely does not (and cannot) check model arithmetic and data validity, just that the analytical approach is appropriate. A separate process of quality assurance (QA) is required to verify and validate model mechanics and output. While mostly absent from academic practice, this is often implemented as a formal procedure in government (DECC, 2015). The reason for this is that unlike academics, governments, private companies and NGOs often model for numbers rather than insight. The specific numbers can be of great societal importance, such as the level at which to set subsidies or the cost of specific policies. Thus, in many cases, the most important aspect is the quality or transparency of input data, rather than the novelty of the modelling methodology. In large datasets used in government decision-making, traceability and referencing can become major problems, as civil servants developing models and data are often not trained scientists. Openly available, collaboratively developed datasets and reference models would allow the burden of this work to be shared more widely, and across both academia and government. There is a growing sense that the link between energy modelling and policy needs fundamental rethinking (Strachan et al., 2016), and opening up models and data will play a crucial role in enabling the transparency and better quality assurance necessary for this to happen. 3. Increased productivity through collaborative burden sharing. Collecting data, formulating models and writing code are resource-intensive. Research funding is limited and researcher time is a scare resource. Society as a whole saves time and money if researchers avoid unnecessary duplication and learn from one another. Individual researchers gain more time to spend on pressing research questions rather than redundant work on model or dataset development. Furthermore, research only matters if it is seen and used, and open-access publishing has been shown to increase readership and citations (McCabe and Snyder, 2014). Since openly shared code or data is more likely to be known to others, it is more likely to be used and further improved. Not only does this benefit the original researcher through peer recognition and academic credit, but moves the research community as a whole forward. 4. Profound relevance to societal debates. Reengineering the energy landscape will affect everyone, producing winners and losers. A balanced societal and political debate requires transparent arguments based on scientific justifications, but escalating concern about reproducibility in some fields is shaking public confidence in scientific research (Goodman et al., 2016). Finally, besides the practical considerations outlined above, there remains the ethical argument that research funded by public money should be available to the public in its entirety.

Why models and data are mostly not open
Despite these arguments, we see four main reasons why closed models and data may remain attractive and rational in some cases: 1. There is a range of valid ethical and security concerns, particularly in the case of data. Researchers may have access to sensitive commer-cial data or to data containing personal information (particularly relevant when moving towards more decentralised smart grids with their focus on individual households). The aspiration to open up as much data as possible may give way to a more regulated approach to open data if individual researchers increasingly cross ethical boundaries, as in the recent release of personal data about users of a major online dating website (Resnick, 2016). Setbacks in the wider open data movement could also have repercussions on the use of information perceived as sensitive in the energy modelling context, e.g. data on energy consumer behaviour or on grid infrastructure. 2. Openly sharing details of models, analysis and data can create unwanted exposure. Flawed code or data can discredit research results and cause embarrassment to their authors, but only if they are visible. Indeed, a reluctance to share data was shown to be associated with weaker evidence (Wicherts et al., 2011). Furthermore, there may be a fear that inexperienced researchers use an open model or open data to produce flawed analysis that reflects poorly on its original authors. There is also a policy dimension: government departments may choose to keep information closed precisely because of the potentially serious impact it may have on a country's economy and society, rather than opening the models and data to enable a more transparent political and societal discussion. For example, while the UK Department of Energy and Climate Change (DECC) are working with University College London to develop an open source UK TIMES energy system optimisation model (UCL, 2014), political sensitivities mean that code and data will not be released until its use in a major policy analysis (the UK's 5th carbon budget) is complete (Sargent, 2016). 3. It is time-consuming to write legible and reusable code, track data provenance and processing steps, document models and data and respond to feature requests or bug reports. Because model and dataset development are large investments, it is often rational for researchers and institutions to maintain "trade secrets" to compete in consulting work and third-party research funding. On the one hand, this can be seen as a classical collective action problem where individual actors are trapped in a suboptimal non-cooperative equilibrium. But, as discussed further below, the incentive structure that gives rise to this bargaining problem is also linked to institutional issues within academia, particularly the unrelenting pressure to publish ever-greater quantities of high-quality publications which underlies most academic career incentives and impact metrics (Sarewitz, 2016). A significant share of energy modelling is done in the private sector, in utilities, consulting firms, and financial institutions, where the need to protect the intellectual property within models and data is certainly more pressing than in academia. Nevertheless, where private sector modelling is used to inform public policy and/or where it is funded by public money, we believe the long-term goal should be for models and data to be open, even if this would challenge consultancies' established business practices. While examples of successful open-source businesses exist (e.g. RedHat or Canonical in the Linux world), it is clear that working business models can be difficult to find, especially in the energy field with the added difficulty of balancing commercial and academic principles. The "share alike" clause in licenses like the GPL (see below) may offer opportunities for companies here. Furthermore, private companies and consulting firms are also selling their expertise: energy models must be adapted for specific analyses, and the real value arguably comes from the application of judgement and expertise to adapt and apply the models in a way that produces useful insight. 4. Finally, there is simple institutional and personal inertia, often alongside complex and uncoordinated institutional setups. For example, energy models and datasets are developed and applied by different agencies within the US government with no consistent and coordinated policy towards data and model availability. On the one hand, the federal government has placed emphasis on making data publicly available through a web portal. 1 With regard to energy data, this has included the openEI portal 2 and a listing of all open datasets within the Department of Energy. 3 On the other hand, individual agencies have long-running practices that do not necessarily align with these high-level intentions. The National Energy Modeling System (NEMS) is publicly available, but its owners, the Energy Information Administration, ostensibly discourage its use by asserting that "Most people who have requested NEMS in the past have found out that it was too difficult or rigid to use" (EIA, 2016). The US Environmental Protection Agency employs closed source computable general equilibrium models for use in economic analysis of climate policy which are developed not by academic institutions but by private-sector consultants, which in turn must protect their own business interests (US EPA, 2016).
All of these factors are understandable from the perspective of individual actors, but collectively they engender a sense of mistrust in complex, impenetrable models and enigmatic datasets. For example, the European Commission faced criticism for using the proprietary PRIMES model to deliver key results for its Energy Roadmap 2050 (Helm et al., 2011). More significantly, the UK's decarbonisation was arguably delayed for years by models that underestimated the scale of the challenge due to opaque and heroically optimistic cost assumptions for onshore wind (House of Lords, 2005).

Is openness in energy research lagging behind other fields?
In the face of such criticism, efforts are underway to make data and models more transparent. In the policy and government decisionmaking sphere, the UK Department of Energy and Climate Change (DECC) pioneered a simple open modelling approach through their 2050 Carbon Calculators (DECC, 2013), which is now being replicated elsewhere. Ongoing but incomplete plans to open up UK TIMES were discussed above, as well as open data efforts in the United States.
Legislation in Europe is evolving; for example, the EU Regulation on Wholesale Energy Market Integrity and Transparency (REMIT) requires participants to publish electricity market data to thwart insider trading and market manipulation. Also in Europe, reference models and scenarios are facing increased scrutiny and criticism, but it is as yet unclear whether this will lead to fundamental changes in how policy analysis is performed. The recent controversy on the European Commission ignoring its own quantitative analysis when selecting emissions reductions targets has pushed the commission to publish a communication justifying the selection of discount rates in its impact assessments (European Commission, 2014). The standard grant agreement of the EU-funded H2020 projects requires open access to all peer-reviewed scientific publications relating to the results of the project. A similar requirement for research data is optional, however (European Commission, 2016a). Some software produced by Commission services is now open sourced under the European Union Public License (EUPL), but there are mixed signals. The closed-source PRIMES model is being replaced by POTEnCIA which was intended to be more transparent (European Commission, 2016b). However, while the database providing some of its input data will be made publicly available (Joint Research Centre, 2016), the new model's code and actual model database will still remain closed.
The German government, on the other hand, has taken a leading role. It sees open energy modelling as an important ingredient for highquality scientific energy policy advice and has publicly committed itself to open software and data. Two flagship projects are funded by the federal government: SciGRID 4 is developing an open-source transmission grid model based on OpenStreetMap data; while the Open Power System Data project 5 is building an online data platform for free and open data for power system modelling, including data on power plant capacities and locations and renewable production time series from transmission system operators. Amongst academic energy system models, BALMOREL (Ravn et al., 2001) is an early example with limited geographical scope, while more recent open models like OSeMOSYS (Howells et al., 2011) and Temoa (Hunter et al., 2013) were followed by a flourishing of activity. In 2014, the Open Energy Modelling Initiative was established, which now lists more than twenty open models of power networks, electricity markets and energy systems. 6 However, the vast majority of energy research published in the peer-reviewed literature still makes neither code nor data openly available. The breadth of model development efforts underway underscores the fact that because of newly emerging modelling challenges, there is a window of opportunity for new, collaborative and open modelling and data gathering efforts to carve out new niches. For example, the accurate characterisation of renewable resources requires new kinds of datasets with high spatial and temporal resolution. Behavioural responses by consumers to prices and incentives will likely play an increasingly important role during the transition to a lowcarbon energy system. Different sectorssuch as electricity, heat and transportationmay become more tightly connected in the future with the deployment of heat pumps and electric vehicles, and therefore require new and more complex types of coupled models.
In contrast to monodisciplinary research (e.g., particle physics simulation) where the models and data are more standardised, energy research has the added challenge that model formulation and data require significant researcher judgment. Unlike the laws of physics, there are no universal governing principles guiding real-world energy systems development. Furthermore, the data used to parameterise energy models are heterogeneous, widely varying in terms of quality, and in some cases lacking altogether. Other applied fields share this issue. It is therefore not surprising that monodisciplinary fields are further ahead with collaboratively developed software packages and standard community databases. For example, in genetics, there are open software papers going back more than a decade, some garnering tens of thousands of citations, underlining their key importance for the research community (e.g., Altschul et al., 1990;Hunter et al., 2013). As open databases become more widely used, for example the GenBank sequence database of all publicly available nucleotide sequences and their protein translations (Benson et al., 2013), depositing results in such databases can be made a requirement by journals. While top journals in many monodisciplinary fields now have policies requiring the release of data, software, and other information required to replicate published results, Energy Economics is the only major energy journal to have put such policies in place.
Despite the efforts in energy research and policy outlined above, our field appears to be lagging behind even other applied fields. For example, in climate science there are coordinated efforts like the Community Earth Science Model (CESM, see Hurrell et al., 2013), which has a large body of peer-reviewed work extending, validating, and using it. The public health and biomedical research communities have over the last two decades progressed both on software (McDonald et al., 2003;Schindelin et al., 2012) and on data (Boulton et al., 2011), e.g. including the challenge of anonymising personally identifiable information (Lawlor and Stone, 2001). Funding bodies are an im-portant enabling factor here: the Wellcome Trust is at the forefront of requiring and supporting open research (Kiley, 2016). Such efforts are important prerequisites to thinking about how open code and data work within the practical constraints of a specific field. For example, while making code and data available openly is an important first step towards higher quality, more transparent, and reproducible research, it is followed by difficult questions about how to truly achieve these goals in particular for research results relying on very large datasets or computationally intensive analyses that require days or months to complete.
There is no reason to believe that energy research cannot rise to the challenge. For example, the U.S. Pacific Northwest National Laboratory recently released the Global Change Assessment Model (GCAM) to the public after 20 years of development (Joint Global Change Research Institute, 2015), demonstrating that even software coming out of large and complex research projects with a long legacy can be made openly available if there is the will to do so.

Policy recommendations
First, individual researchers and research groups need to understand the practicalities of open code and data, in particular: 1. Consider the intended target audience. For example, if the audience is other modellers, releasing code with a clean and documented interface might be the main priority. If it is policymakers, attention to clearly documenting input data and assumptions and allowing reproducibility may well take precedence over code itself.

Decide what and how much to publish openly. Researchers
should not feel forced to take a binary decision of open versus not open. Publishing parts of a codebase or a dataset is better than publishing nothing at all, and can be seen as part of an exploratory process in deciding what and how much to publish openly depending on the goals and contexts of specific analyses. 3. Decide on a license and a distribution channel. The first step is often to check who owns intellectual property rights, which is likely the researcher's employer. For licensing, the main decision is between a permissive license (such as the MIT license for code or CC-BY license for data) or a copyleft or "share-alike" license (such as GNU GPL or CC-BY-SA). This can be a difficult decision to make but much guidance exists (e.g. Morin et al., 2012). In addition, energy research often relies partly on commercial tools such as GAMS or Vensim. While this limits the choice of license (e.g. the GPL license is technically incompatible with a model relying on a closed-source environment to run), models built on such tools can still be openly licensed and made available. This is already common practice in the science and engineering fields, with numerous open projects written for commercial software such as Matlab. Finally, the choice of distribution (e.g. via an institutional website or an established platform like GitHub) also influences how a project is perceived and what kinds of community interactions will likely result.
It is important to realise that all of these choices come with both potential benefits and risks. In addition, existing energy modelling may rely on (commercial) software without even any code to share, and data in proprietary commercial data formats. With the high importance of policy decisions taken based on such work and the increasing availability of open-source alternatives, we believe it is necessary for the field to increasingly shift to those options. The Open Energy Modelling Initiative is developing an evolving working paper to help individual researchers tackle these issues (Pfenninger et al., 2016). Equally important, the energy research community as a whole needs to think about the hard problems around transparent research, open code and data. In particular, we believe the community needs to consider three issues: 1. Reduce parallel efforts and duplication of work. Despite an influx of new open data and software projects, the energy modelling landscape is still fragmented. There is a risk that different modelling and data collection efforts remain insular, and thus that the true benefits of openness are not realised. For example, while open code and data enable third party verification, they do not advance the state of modelling if no one tries to replicate results or directly build upon released code. 2. Progressively consolidate from the top down and the bottom up. Joint top-down coordination efforts could include the development of common datasets, community standards to ensure interoperability, and coordinated efforts to enable third-party verification of model-based results. There are example of communitybuilding science projects spanning several fields such as rOpenSci (Boettiger et al., 2015) that show such efforts can succeed. Yet getting from a blank slate to working community efforts may face too much initial inertia to get started. There are many one-off analyses created for specific papers, or code that is written with the understanding that it will never be made public or see widespread use, and thus be poorly documented and structured. To bridge the gap between such barely re-usable code and well-maintained and documented community codebases, Varoquaux (2015) argues for progressive consolidation of released code as a collective undertaking within a given research community, thus letting community projects grow organically from the bottom up. 3. Work towards changing incentives. In parallel, the energy research community must engage with other stakeholders to ensure institutional and academic recognition for open energy models, and to start tackling the harder problems of transparent, reproducible analyses. There are currently various efforts across different scientific fields to give credit for data and software, but early and midcareer researchers in particular are still subject to the current system of academic credit. Open and transparent research is not currently incentivised: in fact, the opposite is often perceived as advantageous for scientific career advancement. Changing these incentives will require efforts not only from researchers themselves but also from their employers, from grant agencies, and other stakeholders like publishers (Nosek et al., 2015) not all of which may have the same interests. Because of the blurred boundary between academic research and consulting work in the energy field, this coordination effort is likely to be more challenging than in other fields. This may require a cultural shift away from coveting the basic tool (be it an energy systems model or a scanning electron microscope) in favour of the expertise and insight required to use the tool effectively.
Science is an inherently collaborative endeavour. If every particle physics research group had to build their own accelerator, the scientific breakthroughs enabled by projects like CERN would not be possible. Open code and data put deeper and more meaningful collaboration within reach. Energy modelling is not basic science, it is uncertain and the decision stakes are high, putting it squarely within the realm of post-normal science (Funtowicz and Ravetz, 1993). For example, energy system models, alongside wider integrated assessment models, form a significant part of the analysis informing multinational agreements like the UNFCCC and binding national climate policies. They are thus a driving force behind policy targets that affect the future livelihoods of billions worldwide, produce economic winners and losers, and generate heated public discourse. Given the importance of rapid global coordinated action on climate mitigation and the clear benefits of shared research efforts and transparently reproducible policy analysis, openness in energy research should not be for the sake of having some code or data available on a website, but as an initial step towards fundamentally better ways to both conduct our research and engage decision-makers with models and the assumptions embedded within them.