A conceptual framework for evaluating cooking systems


 Provision of household energy is a major global challenge, and realizing the health, environmental, convenience and economic benefits that can come from improved energy services requires altering a complex situation. We describe a Conceptual Framework to guide stakeholders and consumers of information in evaluating performance of interventions. Programs act on cooking or kitchen systems (systems of action) in order to alter the consequences for larger systems in which they are embedded (systems of desired impact). These larger systems also influence the behavior of the systems that they contain. The relationship among systems is formalized by identifying required elements of a theory of change, including a performance metric that represents the system of action and connections to the system of desired impact. A series of 12 questions guides stakeholders in (i) quantifying what needs to be known, and how well; (ii) identifying evaluation approaches suitable for a given situation; and (iii) interpreting evaluation data and making claims about outcomes. A supplemental example illustrates thought processes and tradeoffs while navigating the 12 questions.


Introduction
Providing household energy of adequate quality and quantity is a major global challenge, as reflected in discussions on energy access, gender impacts, and health (Bouzarovski and Petrova 2015). The United Nations Sustainable Development Goal 7 expresses a need for 'affordable, reliable, sustainable and modern energy for all' as well as improved infrastructure and technology for delivery (United Nations 2021). Cooking is a basic human need, yet around 2.6 billion people worldwide still rely on solid fuels, such as wood, dung, agricultural residues, coal, and charcoal (IEA, IRENA, UNSD, World Bank 2021, Stoner et al 2021). These fuels and devices may degrade natural resources, expose users to high levels of emissions that affect health and climate (Naeher et al 2007, Lacey et al 2017, Kodros et al 2018, and consume more money, time and effort than more effective alternatives (Sagar 2005, Jeuland andPattanayak 2012).
It is tempting to believe that clean and efficient cooking can be easily achieved by changing stoves, fuels and kitchens. One challenge is confirming that interventions meet targets in terms of service level, cleanliness, fuel consumption, user preferences, and gender impacts (Barnes et al 1994, Ezzati and Kammen 2002, Brooks et al 2016. These confirmations come about through performance testing that gives information about fuels or stoves, or through project evaluation, which assesses the outcome of interventions. However, cooking systems function in complex situations. The task of cooking combines one or more cooking devices, their associated fuels, foods prepared within specific vessels, and operational sequences. These elements are immersed in and affected by the surrounding kitchens or cooking places, types of housing, household dynamics, community structures, and cultures and cooking practices. Although the cooking system has complex relationships with its surroundings, evaluation often focuses on discrete systems. Procedures for evaluating household energy interventions are widely used (Visser 2005, Makonese et al 2012, Arora et al 2014, ISO TC 285 2018. There has also been work to illuminate how technical performance or programmatic outcomes relate to social or environmental goals (Glasgow et al 1999, Quinn et al 2018, World Health Organization 2018, 2019. However, household energy systems are complex (Malhotra et al 2004, Guruswamy 2015, Jagadish and Dwivedi 2018, and a structured approach has not yet accounted for this complexity. Hence, there is little guidance for identifying and choosing evaluation procedures that fit specific contexts and purposes, and determining when claims of performance outcomes can be made with confidence. Thus, even widely accepted testing or evaluation procedures may be selected without full attention to their contexts or limitations (Abdelnour et al 2020).
Conceptual frameworks have appeared in several fields (Gartner 1985, Hobbs and Norton 1996, Seuring and Müller 2008 to describe systems that interact with and influence each other. This document presents such a conceptual framework for evaluating cooking systems. Its intention is not to answer every stakeholder's question, nor to comment on any Figure 1. Nested systems (holons) related to cooking and its impacts, with numbers from low to high representing smaller to larger systems. Larger systems govern inputs to smaller systems ('external influences'); smaller systems affect the systems in which they are embedded ('exported consequences'). particular evaluation protocol, but rather to present a systematic framework that identifies key concepts for assessing performance and drawing implications. This presentation begins by delineating nested systems that affect and are affected by cooking systems. We use this context to describe the relationship between desired impact and programmatic action. A series of 12 questions is provided to guide stakeholders through the tasks of identifying appropriate performance metrics and targets, determining when a particular testing approach is suitable for a given situation, and interpreting measured performance data. Finally, we reflect on the implications for change and its evaluation in complex systems. For readers who prefer practical guidance rather than conceptual discussion, we provide specific descriptions of the nested systems (appendix A available online at stacks.iop.org/ERL/17/031002/mmedia), a flow chart to outline supportable conclusions (appendix B), a case study that answers the 12 questions in a common situation (appendix C), and a more complex example that builds on the case study situation (appendix D). Figure 1 shows how systems directly related to cooking are nested within larger systems. Such communicating, nested systems are sometimes called 'holons' (Koestler 1970, Edwards 2005, Bland and Bell 2007. In the framework presented here, each system is delineated by a perceived sphere of influence. The cooking system (system 1) includes a cooking device, operating controls, a transformation mechanism for a fuel or energy source, and a place for a cooking vessel. This device is created by a manufacturer, who attempts to deliver design-related performance. The cooking system operates within a kitchen environment (system 2), where a user operates the controls, adds fuel, and employs a cooking vessel.

Evaluation in nested systems
Performance in the kitchen is within the sphere of the stove user. The kitchen environment must serve the needs and work within the constraints of the household (system 3); the household is part of a community with culture, local markets, and political leadership (system 4). Local or national authorities address environmental challenges and health within a region (system 5) and international or multinational organizations attempt to address concerns that affect the entire globe (system 6). Although this work focuses on evaluation, these definitions would also aid in defining the distribution of costs and benefits.

External influences
Even when a party is responsible for a system, that party does not have full control over its performance. Larger systems influence smaller systems by constraining factors that lie outside, but affect the performance of the smaller embedded system. For example, a manufacturer cannot control the quality of fuel used in a stove or the user's operational choices, and occupants cannot prevent the infiltration of polluted air into the household.

See table 1 for examples of systems and metrics.
Correlative relationship. The relationship between a proxy metric and the desired outcome. Exported consequence. An effect that a smaller system has on a larger system in which it is embedded. Desired outcome. The aspirational outcome that stakeholders wish to achieve in the system of desired impact, which may be very difficult to measure. External influence. An effect that a larger system has on a smaller, embedded system. Mechanistic relationship. The predictable connection between the performance metric and the proxy metric. Neighbouring contributions. A change in the system of desired impact, including in the proxy metric, caused by systems at the same level as the system of action Performance metric. A measurable quantity characterizing the behavior of the system of action. Proxy metric. A measurable quantity in the system of desired impact, which is related to the desired outcome. System of action. The system that the stakeholder can change. System of desired impact. The system that the stakeholder wishes to change.

Exported consequences
Each large system contains and is affected by many smaller systems, just as a body is affected by its internal organs. The party responsible for the major system often has little control over the sub-systems. A cook chooses how to operate a stove, but usually cannot alter the stove's basic design or its performance, although it affects her cooking. A region's environmental ministry may not have authority to regulate domestic cooking, but is still responsible for the airshed containing the pollutants.

Terminology
Motivations for changing small systems are often the consequences they export to larger systems. To describe the interplay among systems, we introduce new terms that are summarized in box 1 and illustrated in figure 2. Aspirations often relate to a larger system ('system of desired impact').
Whether the stakeholder is the stove user who seeks to reduce illness or household expenses, a public-health official hoping to improve living conditions in a community, a donor program intending to stimulate markets for better products, or a vendor promoting a cooking fuel, each stakeholder hopes that changes will lead to a desired outcome. However, many of these desired outcomes are difficult or impossible to measure directly. Changing the larger, more complex system is usually impractical or impossible, so actors often focus on changing a small system ('system of action') whose behavior can be characterized with one or more performance metrics.
Two connections are needed to ensure that the change in the system of action could improve the system of desired impact. First, if the desired outcome is not measurable, the system of desired impact must contain a quantity that is measurable and that correlates with the desired outcome. We refer to this measurable quantity as the proxy indicator, and its relationship with the desired outcome as the correlative relationship. There can be no confidence that change has actually occurred without an observable indicator of change in the system of desired impact, even if the proxy indicator is not measured in every project. Second, the performance indicator must have a mechanistic relationship with the proxy indicator. That is, the connection must be objectively observable and quantifiable; a hypothetical relationship is insufficient. This connection links the smaller system with the larger one. This assumption, that changes in the system of action lead to changes in the system of desired impact, is a fundamental theory of change that often goes unstated. Further, when these connections are obscured and their uncertainties are not acknowledged, expectations about changes in the system of desired impact may be unrealistic. The system of desired impact also contains other, smaller systems at the same level as the system of action, that may also affect the proxy indicator. We term these systems neighbouring contributions; they may or may not be affected by changes imposed on the system of action.

Guidance for evaluation
Evaluation procedures can quantify baseline situations, assess whether the desired outcome is likely to be achieved before the activities are carried out, or determine what was achieved after the interventions. Evaluation can be confounded by external influences and by uncertainty in the evaluation process. The 12 questions listed in box 2 are provided to aid in identifying the evaluator's goals, determining how to assess the appropriateness of evaluation mechanisms, and interpreting the results of evaluation procedures.

Identifying systems (Q1-Q2)
The system of desired impact and the system of action depend on the stakeholder seeking the information and the desired outcome of the planned program.

Box 2. Questions to clarify the purpose and interpretation of evaluation.
Q1. What is the system of desired impact?
Q1a. Which stakeholders have an interest in altering that system? Q2. What is the system of action?
Q2a. Which stakeholders alter the system of action and conduct the evaluation? Q3. What is the desired outcome?
Q3a. Which stakeholders have defined the desired outcome? Q4. What is the proxy metric?
Q4a. How is the proxy metric related to the desired outcome (correlative relationship)? Q4b. What uncertainty is associated with that relationship? Q5. What is the performance metric?
Q5a. How is the performance metric related to the proxy metric (mechanistic relationship)? Q5b. What uncertainty is associated with that relationship? Q6. What are the critical inputs?
Q6a. How do the critical inputs vary within the system of desired impact? Q7. What is the target value of the performance metric?
Q7a. How has the target value been chosen? Q8. What level of confidence in conclusions is desired? Q9. What is the measured value of the performance metric and its uncertainty?
Q9a. Which evaluation protocol is fit for the purpose of assessing the value of the performance metric? Q9b. What is the value of the performance metric, measured with the chosen protocol? Q9c. What is the uncertainty in the performance metric due to random errors, systematic errors, inherent variability, and critical inputs? Q10. Was the performance metric target value achieved within the system of action with the required level of confidence? Q11. What are the predicted changes in the proxy metric and the desired outcome?
Q11a. What factors outside the system of action affect the proxy metric and the desired outcome (neighbouring contributions, critical inputs)? Q11b. What total uncertainty is associated with the proxy metric? Q11c. What total uncertainty is associated with the desired outcome? Q12. Is an observable change expected in the system of desired impact?
The system of action is the largest system directly altered by the program. The system of desired impact is typically much larger and more complex than the system of action. Table 1 lists some possible systems of desired impact and of action relevant to different stakeholders. CI: cooking and non-cooking tasks NC: changes in other sectors such as vehicles and industrial activity a A cooking task is a function that the intervention is intended to replace, and that affects the desired outcome. b A non-cooking task is a function that the intervention is not intended to replace, but that also affects the desired outcome. Table 1 illustrates a desired outcome, proxy metric, and performance metric for some possible situations. It is important to differentiate between the three outcomes: the desired outcome (Q3) which occurs in the system of desired impact and may not be measurable; the performance metric (Q5) that characterizes the system of action; and the proxy metric (Q4) that is an intermediate step between the performance metric and the desired outcome.

Identifying critical inputs and their influence (Q6)
Because the system of action operates inside the system of desired impact, it must be evaluated under conditions similar to those occurring throughout the system of desired impact. Specifically, inputs that affect the performance metric-'critical inputs'must be comparable between the test used for evaluation and those encountered in the system of desired impact. Otherwise, the performance metric fails to represent the performance of the system of action as it is commonly used.
The influence of critical inputs can be determined through sensitivity studies or by measuring the performance metric in naturally-occurring conditions throughout a larger system. This task is best suited to developers of a measurement protocol in consultation with implementers and other stakeholders, and is not well suited to individual programs or evaluators.
The answers to question 6 capture what has been called a 'controlled versus uncontrolled' dichotomy. A designer might test a device to ensure its performance under well-controlled conditions, ignoring external influences. For public health evaluations, however, the entire community may be the system of desired impact, adding factors that influence stove performance. This difference has also been called 'laboratory versus field testing' (Medina et al 2017), although it is caused by variation in critical inputs and imposed controls rather than the physical location of testing.

Identifying performance targets and confidence (Q7-Q8)
Once the performance metric has been identified, the evaluator selects a target value for each metric (Q7) that indicates the success of the project. The statement that the system has achieved the desired level of change must be given with an expressed level of confidence (Q8). There are at least three ways to choose target values, which are illustrated in figure 3 and discussed below. Appendix D contains an example of each method for the case study situation in appendix C.
(i) Select a statutory or broadly-accepted value such as a performance standard or an environmental standard. These standards might be externally imposed by a government, industry or funding agency. (ii) Compare final performance with a previous result or baseline, for example 'consumes 25% less fuel than the traditional stove' or '50% less time spent preparing fuel.' This option requires capturing metric values before and after the intervention. Baselines can be determined through measurement or estimation, but all entail some uncertainty. (iii) Relate the performance metric to a desired value of proxy metric. A target proxy metric value is assigned and the equivalent performance metric value is determined by working backward through the mechanistic relationship using equations, statistical relationships or computer simulations. A target desired outcome may also be chosen directly, which would require working backward through the correlative relationship between the desired outcome and the proxy metric. These additional steps introduce greater uncertainty. Figure 3 illustrates comparisons of the performance metric with central values (squares) and the reported confidence interval (bars) for three hypothetical cooking systems. System A meets the performance metric target value with high confidence, as all likely values meet the target. System B has questionable performance; even though the central value meets the target, many of the likely values do not. The performance of system C is better, yet some of the likely values do not meet the target. However, if the accuracy of the measurement method could be improved (black lines in (i) or (ii)), then there could be greater confidence that system C met the performance target. If the uncertainty in the measurement procedure is large enough, it may not be possible to make a claim of success with the appropriate confidence. In that case, a new procedure should be sought, the apparatus improved, or the target may need revision.
When the target is a proxy metric or desired outcome rather than a performance metric (figure 3(iii)), additional uncertainties can be large. Consequently, there is a lower likelihood of detecting a small change with high confidence.

Measuring the performance metric (Q9)
Once the performance metric, target, and required confidence are determined, a measurement protocol can be sought. Several requirements make a protocol Fit for Purpose (De Bievre 2010).

Validity and relevance
The protocol should be valid from a theoretical standpoint, meaning that it must have neither misconceptions in the measurements nor with its interpretation. The protocol should measure a performance metric that is suitably connected to the proxy metric. Relevance indicators may be used to assess whether the protocol is conceptually (internally) consistent, as well as consistent from the perspective of common sense (externally) (Gaskell and Bauer 2000). The validity and relevance of a protocol and a metric usually cannot be assessed by a protocol user; users should look to previous review processes.

Situational appropriateness or contextuality
The protocol should be suited to the skill of the operator and it should be possible to conduct the measurement in the required environment. The instructions given should be appropriate and unambiguous, so that all operators interpret them the same way. The time required to complete the protocol should be justified by the amount of information gained.

Transparency and traceability
The individual executing the protocol must document all choices made so that tests can be replicated. Documentation should include operation of the measurement, including instrumentation specifications. The documentation should also record choices of critical inputs used during the measurement (Q6), so that a reviewer can assess relevance to the system of desired impact. Confidence indicators may be used to demonstrate that the protocol, its associated measurements, and interpretation of results represent reality (Gaskell and Bauer 2000).

Reporting experimental error
The protocol must report variability inherent in measurements, which is caused by instrumental uncertainty and differences in inputs and operator practice. These errors can be quantified and reduced by examining repeatability and reproducibility. A repeatable measurement produces similar results under the same conditions; a reproducible measurement produces similar results when conducted in different locations or at different times.

Quantifying variability due to critical inputs
The protocol should identify the range of critical inputs expected throughout the system of desired impact, and the reported performance metric should quantify variability caused by that range of critical inputs. For example, if a political district is the system of desired influence, and critical inputs are fuel type and size, the protocol should report variability that would occur with all fuel types and sizes used within the district. This requirement is similar to the idea of external validity (Steckler and McLeroy 2008), but not as stringent; the performance metric must represent the system of action only where it is to be deployed.

Quantifying natural variability in the tested system
The system of action (for example, the cooking system or kitchen environment) often has inherent variability in performance, which can be eliminated only through artificial, strict control. An example is operator practice in feeding solid fuel. Whether variation is attributed to inherent variability or to critical inputs is arguable, but the most important goal is counting all variability under one of the groups.
The criteria listed above are each separately applied. A protocol may be transparent, but not valid to measure the desired performance metric. A protocol may be valid and transparent, but too complex to conduct in a field setting, and thus situationally inappropriate. The protocol may be valid, transparent, and appropriate for the situation, but may not quantify variability sufficiently. Further, including too many diagnostics in a single protocol could cause it to be expensive, cumbersome and limit its situational appropriateness.
There are formal methods of combining variability or uncertainty, whether introduced by experimental error, critical inputs, and natural variability (Morgan et al 1990, Cullen andFrey 1999). The combined uncertainty is to be used in answering the questions that follow.

Interpreting evaluation results for the system of action (Q10)
If the previous choices are clearly answered and well conducted, the user is in a good position to answer Question 10: Has the system of action met the target performance metric with the required degree of confidence?
An affirmative answer to this question means that the system has achieved the target performance metric value. Uncertainties in the measured values, and large variability caused by critical inputs, can prevent the evaluator from being able to answer in the affirmative. Additional uncertainty is incurred when the target is set through a connection to the proxy metric. Methods to reduce uncertainty include (a) increasing robust performance of the system for the full range of critical inputs; (b) decreasing uncertainty inherent in the measurement protocol, but not by reducing the range of critical inputs; and (c) setting a target in the system of action rather than a larger system.

Interpreting evaluation results applicable to the system of desired impact (Q11, Q12)
The performance metric (in the system of action) must have a mechanistic relationship with the proxy metric (in the system of desired impact), and the proxy metric and the desired outcome must have a correlative relationship. However, the proxy metric might also respond to other systems nested in the system of desired impact (neighbouring contributions), and it might have external influences from larger systems. This possibility is even more likely for the desired outcome, as the correlative relationship to the proxy metric introduces yet more variability. Justifiable statements associated with a sequence of findings are summarized here and shown in a flow chart in appendix B.
• If the performance metric has not been tested with the full range of critical inputs, then no conclusions can be made about changes in the system of desired impact. • If the performance metric represents the full range of critical inputs, and meets the target metric with confidence, then the system of action has achieved the stated performance goal, but an observable change in the system of desired impact is not guaranteed. • Relative variability in the proxy metric is greater than that in the performance metric (figure 3(iii), Q11b) due to the mechanistic relationship and other influences. Projected uncertainty is needed to determine whether the change is detectable in the system of desired impact.
• Relative variability in the desired outcome is greater than that of the proxy metric (Q11c) due to the correlative relationship and other influences. Assessment of change in the desired outcome is advisable only when uncertainties have been accounted for, and a detectable change is still expected.
It may not be possible to measure a change in the system of desired impact, even if the system of action meets its performance target. This is a common conundrum. The larger system of desired impact might have provided the motivation for change, but realistic programs produce changes that may be observable only at the level of the smaller system affecting it.

Summary and outlook
This document proposes a framework for appropriate evaluation of cooking systems. We guide evaluators to develop quantitative theories of change that connects the performance of a small system of action to proxies and desired outcomes in a larger system, outlining features of evaluation protocols that render them fit for the purpose of gauging performance. Assessing uncertainty within the system of action, and of influences external to that system, is key to choosing appropriate protocols and to drawing inferences about performance and impacts. When the questions outlined are answered carefully, evaluators, policymakers and decisionmakers can gain confidence in the reported results and their interpretation.
Uncertainties can confound the detection of change, especially in the system of desired impact. This confounding of outcomes limits the ability to conduct well-accepted types of assessment, including randomized controlled trials. It has long been acknowledged that flexibility in design is needed for interventions in complex systems (Shiell et al 2008). The framework presented here demonstrates that assessing outcomes in complex, nested systems also necessitates adaptability. Positive change in complex systems with diverse influences requires systematic methods of recognizing incremental but meaningful contributions, as well as acknowledgement and acceptance of situations when detection of significant change is not feasible.

Data availability statement
No new data were created or analysed in this study.