Methods matter: Improved practices for environmental evaluation of dietary patterns Global Environmental Change

Making food systems more sustainable is one of humanity ’ s largest challenges. Over two decades of life cycle assessment research on the environmental performance of food systems has helped to inform efforts to address this challenge. In recent years, there has been much interest in aggregating the results of these studies at scales of national production, dietary patterns, and future food scenarios. The process of comparing impacts of diverse products based on extant literature presents numerous challenges which have been inadequately addressed. Drawing upon examples of greenhouse gas emissions and seafood systems, we suggest best practices to support more complete, consistent, and comparable aggregation practices. Ultimately this would lead to more robust industry and consumer decisions and public policy. We suggest to: 1) define product groups reflecting impact drivers and in accordance with study goals, 2) select studies in a transparent way whose methods are consistent, and 3) assess results in the context of actual production or consumption patterns. Applying these practices would strengthen food life cycle assessment aggregation studies as a tool guiding towards sustainable food systems. For each suggested best practice, we describe recent problematic applications in well-cited aggregation studies and illustrate with examples the potential effect of poor and better practice, and how applying these ‘best practices ’ could improve the quality of future studies. The identification of these guidelines was based on experience


Introduction
Supplying humanity with food is tightly connected to virtually all regional-to global-scale resource depletion and environmental degradation challenges we confront (Campbell et al., 2017, Foley et al., 2011, Poore and Nemecek, 2018 including the global climate crisis (Campbell et al., 2017. Global food systems currently contribute a quarter of all anthropogenic greenhouse gas (GHG) emissions Nemecek, 2018, Vermeulen et al., 2012), a proportion expected to increase as other sectors, e.g. the energy and transport sectors, more rapidly decarbonize (Willett et al., 2019).
Understanding impacts of food provisioning activities is not new (Andersson et al., 1994, Carlsson-Kanyama, 1999, Tilman, 1999, but focus has traditionally centered on understanding individual food systems and their challenges. In this context, life cycle assessment (LCA) has become an important technique to understand the environmental performance of food production and onward supply chains (Fig. 1).
Recently, attention has shifted to scaling up insights gained from individual food LCA studies to assess contributions to global-scale challenges like land or water use, and GHG emissions, etc. that arise from specific diets or aggregate national-to global-scale patterns of production or consumption (Tilman and Clark, 2014;Clark and Tilman, 2017;Clune et al., 2017;Hilborn et al., 2018;Poore and Nemecek, 2018;Springmann et al., 2018, Willett et al., 2019Clark et al., 2019). These aggregation efforts use results from individual food LCAs to identify broader impact patterns and reduction opportunities and are necessarily limited by, the scope, representativeness, and methods used in the underlying studies . Importantly, how results of individual food system studies are combined to yield diet or population-scale impact insights also varies widely, but has received little attention to date. Given the urgent need to reduce GHG emissions from the global food system and the need to provide guidance to policymakers, consumers, and industry, based on LCA outcomes, it is important to scrutinize and improve the methods used in aggregation studies.
Here we suggest three 'best practices' that address widely occurring, but easily avoided, pitfalls when aggregating results of individual food LCAs to the scale of diets or population-level production and consumption patterns. For each suggested best practice, we describe recent problematic applications in well-cited aggregation studies and illustrate with examples the potential effect of poor and better practice, and how applying these 'best practices' could improve the quality of future studies. The identification of these guidelines was based on experience producing, reviewing, and applying extant food LCA research underpinning aggregation efforts combined with knowledge about emission drivers in food systems, rather than on a formal systematic review. We often draw upon seafood examples because of the varied nature of seafood systems and because inclusion of seafood in aggregation work is a common source of confusion or mistreatment (Farmery et al., 2017, Bogard et al., 2019, Tlusty et al., 2019, and we often use GHG emissions in our examples, but the problems and solutions we describe are applicable to all food sectors and the aggregation of results for any impact of concern.

Group products to reflect impact drivers and study goals
Given the diversity in food products and production systems available, it is necessary to group products when modeling and communicating their relative impacts. Our first suggested best practice is to make these groupings relevant to the goal of the study. We argue that these groupings should be selected with regard to underlying drivers of variability within each group and to the questions the analysis aims to answer. In the individual studies underpinning aggregation efforts, specific method choices are aligned with specific study goals, as mandated by LCA guidance documents (e.g. ISO 2006, a,b) and so should the aggregation efforts using them.
Many current food LCA aggregation studies group by taxonomy, i.e. species or groups of evolutionarily related species (Tilman and Clark, 2014, Clark and Tilman, 2017, Clune et al., 2017, Hilborn et al., 2018, Poore and Nemecek, 2018Springmann et al., 2018, MacLeod et al., 2020. However, if the study aims to assess GHG emissions of diverse products, the taxonomic placement of a species is not a particularly relevant attribute, nor is it particularly relevant to decisionmakers in a position to effect GHG emission reductions. For many food types, differences in production method (e.g. fishing gear, field or greenhouse grown crops, fed or unfed aquaculture) is a more important determinant of environmental impacts than species (Parker, 2012;Vázquez-Rowe et al., 2012;Ziegler et al., 2016). Field-grown tomatoes and cucumbers, for example, are more similar in terms of magnitude and drivers of GHG emissions than field-grown and greenhouse-grown cucumbers. In other cases, biological characteristics of an organism, like enteric fermentation in ruminant mammals, are important drivers of emissions and relevant to the grouping of products, not because of the taxonomic placement of those animals but because the biological difference (e.g. ruminant methanogenesis) is directly relevant to the goal of the study (quantifying GHG emissions of a diet).
Greenhouse gas emission aggregation groupings have also been undertaken reflecting ways in which production statistics are frequently reported, e.g. per country or small-scale/large-scale (e.g. Greer et al., 2019). However, while perhaps relevant from a national accounting or socioeconomic perspective; 'country', 'scale of production' or 'intensity' do not correlate with important sources of GHG emissions and while convenient, the resulting groupings will fail to communicate which types of systems or products have higher or lower emissions because they do not capture the most important differences , Philis et al., 2019, Bohnes et al., 2019.
Defining group membership based on consistent and shared drivers of impact not only aligns the aggregation method with the assessment objective, but also results in lower within-group variability. It will also more faithfully represent actual differences in emission intensities between groups. Importantly, defining groups based on impact drivers means that group definition and membership will vary with the study objective(s), resulting in different groupings being suitable in different cases. Groupings that make perfect sense given one objective (e.g. quantifying GHG emissions of a diet) don't necessarily translate to meaningful distinctions when quantifying other environmental metrics (e.g. eutrophying emissions of a diet). In addition, as important new sources of impacts are discovered (e.g. GHG emissions from using land and seafloor areas for food production), the number of groups defined by drivers of impacts may change.
From a GHG emission perspective, non-fed aquaculture systems are more similar to each other than to fed ones, irrespective of species. In fact, non-fed aquaculture is more similar to capture fisheries in terms of emission drivers, since both are often dominated by fuel use (Gephart et al., 2021, Aitken et al., 2014, except in conditions when shell formation makes important contributions (Ray et al., 2018). In contrast, feed inputs are the dominant driver of emissions in many aquaculture systems, while other systems may produce similar species taxonomically but be associated with a very different set of impact drivers and reduction opportunities.
Two studies have attempted to take seafood production method into account when aggregating pre-existing study results into groupings that the authors assumed aligned with GHG emissions. Tilman and Clark (2014) and Clark and Tilman (2017), separating fisheries data into trawling and non-trawling fisheries and aquaculture data into recirculating and non-recirculating aquaculture. These groups, however, result in net-pen salmon aquaculture falling into the same group as unfed mussel farming and shrimp culture in ponds converted from mangrove forest, since they are all non-recirculating aquaculture systems, while spanning the entire range of GHG emissions of seafood systems. It is well established that GHG emissions of contemporary salmonid aquaculture at farm-gate are overwhelmingly driven by upstream feed production (Pelletier et al., 2009;Parker, 2012;Ziegler et al., 2021). Unfed farmed mussels, have a much lower GHG intensity and main drivers include energy and material inputs (Ziegler et al., 2013;Runesson, 2020) and extensive shrimp ponds can be associated with dramatic rates of GHG emissions arising from land transformation (Järviö et al., 2018). Similarly, dividing fisheries into trawling and non-trawling groups (Tilman Published studies are still being entered into this database under construction, the number of studies in total and per category is therefore non-exhaustive, but the graph clearly illustrates the recent rapid increase in food LCA case studies. and Clark, 2014) may be relevant to some fisheries management decisions but is illogical when one is interested in characterizing GHG emissions: for example, the emission intensity of longline fisheries for albacore tuna, trap fisheries for lobster, and purse seine fisheries for anchovies (all grouped as non-trawl fisheries) vary by orders of magnitude and are found at different ends of the range in fuel efficiency of fisheries worldwide . Similarly, grouping midwater trawls with bottoms trawls combines some of the most emissionsintensive fisheries for shrimps with some of the world's most efficient fisheries for sardines and herrings-an important detail which was corrected in the more recent work by Clark and Tilman (2017) where midwater trawling was removed from the trawl fisheries group and instead placed in the non-trawl group. Grouping like this, without taking impact drivers into account, leads to high within-group variability since items that are highly different in their performance are grouped together (Fig. 2). In addition, this way of grouping may reduce potential betweengroup differences. Fig. 3 illustrates how food production systems can instead be  Clark and Tilman (2017) categorized for a) aquaculture and b) fisheries in the seafood groups defined by authors to the left (recirculating/ non-recirculating aquaculture; trawling/non-trawling fisheries) and based on emission drivers (Fed/non-fed; High-/low-fuel) to the right in each panel. Error bars show standard deviation. Note that driver-based grouping is the only one of the best practices suggested in this paper that was applied to these data.
logically grouped using known major sources of GHG emissions with examples drawn from seafood systems. Basing aggregation analyses on major drivers of variability in impacts of interest has implications not only on the way products are grouped, but also on the identification and prioritization of data gaps. Several authors (e.g. Halpern et al., 2019;Bohnes et al., 2019;Cucurachi et al., 2019) suggest a taxonomically-and geographically-based strategy to identify data gaps for future assessment of food products, based on patchy coverage of food LCA studies in terms of species, impact categories and geographic origin. Identifying gaps based on key impact drivers would lead to potentially very different priorities for gap-filling.

Select studies that are relevant and whose underlying methods are consistent
Besides grouping, how case study results are selected for inclusion in an aggregation effort has a large influence on the outcome and our second best practice is to only use studies in the aggregation effort that are relevant and whose methods are comparable.

Apply clear inclusion criteria
When undertaking any data aggregation effort, it is desirable to use as much high-quality data as possible. Using a subset of available data can lead to a different outcome and could lend itself to a seeming cherrypicking exercise. While data search methods are often provided (e.g. Tilman and Clark, 2014;Clark and Tilman, 2017), it is unclear why many available published studies are missed, while grey sources are included, based on unclear criteria. Sometimes (Tilman and Clark, 2014), exclusion of studies is simply declared without clear explanation to support the decision (e.g. "uncommon" without defining "uncommon"), while studies representing experimental or niche production systems (e.g. Ayer and Tyedmers, 2009) are included. In the above noted example of recirculating aquaculture systems, at least nine studies were published in international journals up to 2015, while Tilman (2017, submitted in Dec 2016) base their aggregation efforts for recirculating aquaculture on only four studies whose GHG emission intensities varied by over an order of magnitude. Moreover, at least one system whose results were included was hypothetical in nature, another was a small niche production system which no longer operates, and a fifth study referenced by the authors as derived from "Pelletier 2010", on trout, does not exist. If results of all nine recirculating aquaculture case studies available in 2015 had been used, the resulting arithmetic mean emission intensity would have been ~ 30% lower than presented by Clark and Tilman (2017) and the resulting difference between the four seafood groups used to aggregate across systems would have been smaller and possibly not statistically significant. Therefore, making efforts to find available literature and defining and applying clear inclusion criteria is critical. Further, if insufficient studies are available to support a representative estimate for a defined group, then that should Fig. 3. All seafood production systems can be sorted into up to seven categories based on three potentially substantial sources of greenhouse gas emissions based on current insight from available seafood LCA studies. Note that the areas occupied by the seven categories in the figure do not correspond to either the potential number of systems or volume of production represented by each. While absolute emissions within each group vary widely depending on rate of inputs and outputs, groups share common drivers, improvement opportunities, and policy recommendations. Table 1 LCA case studies of different seafood systems with country-and species-specific production volumes for 2019 (FAO, 2021) and estimated GHG intensities from published literature. For each seafood system, aggregated estimates of emissions are provided following two approaches: a simple average of the two cases, and a weighted average that takes into consideration each region's production volume. be recognized, and groups redefined rather than presenting inaccurate representations of production. In contrast, Poore and Nemecek (2018) present clear and relevant inclusion criteria, extend their data search to include grey literature, and list studies that were excluded along with the specific unmet inclusion criteria.

Align methods
Individual LCA studies use methods and report results in ways that, while appropriate to their specific study purposes, may be highly idiosyncratic and incomparable. Goals of individual LCA studies rarely include being fully comparable to all prior studies in a food category, but more often attempt to quantify impacts and improvement potentials of the specific food system under study or compare alternative production practices within the same system, e.g. conventional vs. organic feed use. Guidance documents (e.g. ISO, 2006a, b) mandate alignment of method choices with study goals and compliance with standards therefore does not guarantee comparability since study goals differ. Central method choices that often differ include: the definition of the product to be compared (the "functional unit"); how far along a supply chain the analysis is undertaken, and which underlying processes are included (the "system boundaries"); whether analyses model existing systems or hypothetical futures (attributional vs. consequential LCA) how burdens are distributed between co-products (the "allocation method"); if, and how, direct and indirect land use change (LUC) are accounted for; amongst others. Each of these can have a major influence on study results and while it is well accepted in aggregation work that the functional unit needs to be harmonized, awareness is lower about the need to align other central method choices .
Averaging results of studies employing different methods without regard to the impact of those differences on values assembled is no different from calculating an average value of prices in different currencies. Poore and Nemecek (2018) acknowledge this challenge and include criteria on methodological compatibility (or the possibility to recalculate results following a harmonized methodology) in the inclusion criteria mentioned above. Returning to the Clark and Tilman (2017) example and the studies used by them to represent recirculating aquaculture, those included apply profoundly different strategies for coproduct allocation (e.g. Ayer and Tyedmers (2009) allocated using nutritional energy while Samuel-Fitwi et al. (2013) used system expansion) which renders their results clearly incomparable. Similarly, the studies from which Tilman and Clark (2014) drew GHG emission data for fisheries encompass numerous allocation methods: system expansion (Thrane, 2004), economic (Ziegler et al., 2003), mass (Vazquez-Rowe et al., 2011) and temporal (Ramos et al., 2011). The choice of allocation method can change impact assessment results of fishery (Thrane, 2006) and aquaculture (Parker, 2018) products dramatically.
When aggregating LCA data, care has to be taken to avoid using studies with questionable method choices which may further distort comparisons. For example, studies which treat carbon contained in feed as sequestered but exclude subsequent respiration of the same carbon (Kallitsis et al., 2020) are fundamentally at odds with standards and the majority of how food system GHG emissions are modeled. Some studies may also exclude demonstrably critical elements of the product life cycle, such as in modeling emissions of oyster production leaving out fuel and material inputs, which are important GHG emission sources in bivalve LCAs (Runesson, 2020), to then conclude that GHG emissions from farmed oysters are low compared to other foods (Ray et al., 2019). Methods to account for emissions from land transformation have improved over time and standards today mandate inclusion (e.g. BSI, 2012, Zampori and Pant, 2019, ISO, 2020, which has resulted in substantial increases in emissions when crop inputs to feeds are grown on recently converted lands. For example, farmed salmon is increasingly fed soy to replace marine protein (Aas et al., 2019) and the need to account for emissions from land use change renders results of recent salmon LCA studies  incompatible with earlier ones (Pelletier et al., 2009, Ziegler et al., 2013.
Many of the food LCA aggregation studies referred to earlier treat LCA studies employing widely different methodologies as fully comparable and simply extract their results and proceed with calculating group averages as if the underlying differences had no effect.
Best practices in this regard are to only aggregate results of studies that use similar methodologies, i.e. excluding those that apply different Table 2 Summary of best practices identified, steps required to follow them and risks if not followed.

Best practice Required steps and considerations
Risk if not followed • Introduce potential bias through the comparison of systems with diverse methods • Drawing conclusions regarding the relative impact of systems that may not reflect their relative performance had methods been aligned • "Comparing prices in different currencies" or "comparing apples and oranges"

Reflect the representativeness and distribution of data points within each group Represent rates of production or consumption
• Select appropriate weightings based on objective of the study (e.g. region-specific vs global, consumption vs production, etc.) • Estimates skewed towards less representative systems and systems that are over-studied relative to their consumption and/ or production Reflect the distribution of results within groups • Where multiple values are available for a group after applying inclusion criteria, prefer to communicate the median rather than the mean, and interquartile range rather than complete range • Estimates are skewed, likely higher, by outliers • Estimates not reflective of typical production or consumption methods, as done by Poore and Nemecek (2018). This will inevitably reduce the data available for aggregation which of course is undesirable but may be mitigated, in part, by defining larger groups on the basis of key drivers. Alternatively, if key inventory data on critical drivers of impacts and production data (i.e. input data on resource use) are reported in different studies, those data can be used to recalculate results applying one consistent methodological approach throughout. Philis and colleagues (2019) employed this approach to align allocation methods and functional units to better compare results of farmed salmon studies while Bergman and colleagues (2020) used a similar approach to compare farmed salmon with tilapia. Runesson (2020) recalculated bivalve life cycle inventory data and Gephart et al. (2021) did this for global seafood production.

Reflect the representativeness and distribution of data points within each group
The third best practice relates to the ways in which data from individual LCA studies are aggregated to represent a larger pattern of production or consumption -in a region or globally. Frequently, in aggregation studies, results of LCA case studies are treated as equally representative data points, despite different studies frequently representing very different scales of production. Our third best practice is therefore to use the data in a way that faithfully represents its contributions to a group or patterns of consumption or production studied.

Represent rates of production or consumption
Many LCA case studies functionally represent very small production volumes or characterize experimental or emergent techniques that almost by definition have not yet achieved commercial scale or efficiencies. This is because they are often undertaken to understand or compare specific products or production practices, or locally important sectors rather than systematically representing regionally-or globallyimportant production systems. In contrast, when trying to characterize, national, regional or global production or consumption of food, it is important to draw data from broad sectors of the industry that reflect conventional, commercial-scale practices. Consequently, results of many LCA studies are simply of little to no value when aggregation studies set out to characterize patterns of large-scale production or consumption. Even when commercial-scale operations have been characterized in a set of LCA studies relevant to the aggregate system being assessed, the production volumes represented by different studies may vary widely, with some practices reflecting large portions of global production while others produce relatively small volumes. For example, the number of LCA studies of land-based recirculating farmed salmon culture systems is larger than that of studies of marine net-pen systems (Philis et al., 2019), despite the latter producing virtually all farmed salmon available globally. If LCA case studies are used as if the systems they represent are equally representative, as occurs when a simple average is calculated (as e.g. Tilman andClark, 2014, Clark andTilman, 2017 and others routinely do), resulting emissions can be grossly overestimated and misrepresent the products being characterized in an aggregation (Table 1).
Best practices to avoid this are to either exclude marginal systems (emerging or niche system) altogether through inclusion criteria (as done by Poore and Nemecek, 2018;Hilborn et al., 2018), or apply an appropriate weighting factor to each data source that reflects their relative contribution to the phenomena being characterized in the aggregation effort (e.g. Poore and Nemecek, 2018, Hilborn et al., 2018, Hallström et al., 2019. As evident in Table 1, using simple averages leads to very different results.

Reflect the distribution of results within groups
Despite multiple reasons presented to exclude data points from aggregation efforts, many data points are likely to remain available in appropriately defined groups. Plotting the distribution of emission estimates within groups, we often observe a positively skewed distribution of values (Nijdam et al., 2012;Poore and Nemecek, 2018;Parker et al., 2018). When such a positively skewed distribution of representative values occurs, characterizing their central tendency using the mean of the included studies will result in an overestimate of impacts and providing a max-min range without indicating the distribution of values within that range will similarly suggest an overestimated rate of emissions. We suggest that, in cases where multiple studies are being drawn upon to directly calculate an estimate for any group of products, median values and interquartile ranges are more appropriate representations of the findings of the studies available. Medians and interquartile ranges have been applied successfully, for example, in emissions synthesis work by Hilborn et al. (2018) and Poore and Nemecek (2018).

Implementation and policy advice
Quantitative assessment of the sustainability of food production systems is essential given the urgent need to rapidly transform food systems to limit most global-scale resource depletion and environmental crises. The rapidly growing body of food LCA studies and related research provides a wealth of knowledge and data from which to estimate aggregate or average impacts across food items (e.g. diets), methods of production (e.g. organic vs. conventional) etc. However, if these and other food impact aggregation efforts are to robustly represent what they set out to understand, they must be based on sound and transparent methods. Above, and summarized in Table 2, we identify major ways in which current food impact aggregation efforts frequently fall short, and describe a set of best practices that should be adopted by researchers when undertaking food impact aggregation studies regardless of setting, or scale. Adoption of these best practices would also provide a more robust and data-driven basis for future data gap-filling efforts to where they are most needed, increasing the overall

Box 1 Checklist for food LCA aggregation readers/users
✓ Are product groups based on key drivers of the impact of interest (e.g. GHGs)? ✓ Is the basis for the grouping clearly presented and explained? ✓ Is the literature search method described transparently (with regard to search engine, databases, search terms and date and number of studies found?) ✓ Is a clear basis for inclusion/exclusion of studies presented? ✓ Do the inclusion/exclusion criteria address key LCA method choices (e.g. allocation, system boundaries, type of LCA, land use change modelling)? ✓ If data from studies with differing methods are used, have new analyses been conducted so that key method choices are aligned? ✓ Were experimental/emerging production systems weighted according to their relative contribution to the activity being represented -or excluded through exclusion criteria? efficiency and effectiveness of our collective food system assessment efforts. Readers and users of food LCA aggregation studies, including publicor private-sector decision-makers, can use the questions listed in Box 1 to check if a study follows these best practices or not.
The broad-scale patterns identified by aggregation studies that successfully implement a more rigorous methodological approach can provide a powerful basis for public policy and private decision-making, transforming food systems towards producing and consuming lower impact foods both within and between food products and systems. All methods provide answers but without requiring careful adherence to robust, scientifically-sound practices, the answers provided may not only limit our understanding of reality but guide us away from the more sustainable future we need to achieve.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.