COSPAR Sample Safety Assessment Framework (SSAF)

The Committee on Space Research (COSPAR) Sample Safety Assessment Framework (SSAF) has been developed by a COSPAR appointed Working Group. The objective of the sample safety assessment would be to evaluate whether samples returned from Mars could be harmful for Earth’s systems (e.g., environment, biosphere, geochemical cycles). During the Working Group’s deliberations, it became clear that a comprehensive assessment to predict the effects of introducing life in new environments or ecologies is difficult and practically impossible, even for terrestrial life and certainly more so for unknown extraterrestrial life. To manage expectations, the scope of the SSAF was adjusted to evaluate only whether the presence of martian life can be excluded in samples returned from Mars. If the presence of martian life cannot be excluded, a Hold & Critical Review must be established to evaluate the risk management measures and decide on the next steps. The SSAF European Space Agency, Mars Exploration Group, Noordwijk, The Netherlands. NASA Headquarters, Office of Planetary Protection, Washington, DC, USA. Goethe University, Department of Geoscience, Frankfurt, Germany. UK Health Security Agency, Rare & Imported Pathogens Laboratory, Salisbury, UK. NASA Johnson Space Center, Astromaterials Research and Exploration Science Division, Houston, Texas, USA. Clarkson University, Department of Mechanical and Aeronautical Engineering, Potsdam, New York, USA. NASA Goddard Space Flight Center, Solar System Exploration Division, Greenbelt, Maryland, USA. Security Programs, Engineering Biology Research Consortium, Emeryville, USA. Rutgers University, Department of Earth and Environmental Sciences, Newark, New Jersey, USA. The Open University, Faculty of Science, Technology, Engineering & Mathematics, Milton Keynes, UK. NASA Goddard Space Flight Center, Astrochemistry Laboratory, Greenbelt, Maryland, USA. Japan Aerospace Exploration Agency ( JAXA), Institute of Space and Astronautical Science (ISAS), Chofu, Tokyo, Japan. New Mexico Institute of Mining and Technology, Biology Department, Socorro, New Mexico, USA. Erasmus University Medical Centre, Department of Viroscience, Rotterdam, The Netherlands. NASA Headquarters, Planetary Science Division, Washington, DC, USA. Centre National d’Études Spatiales (CNES), Nancy, France. Princeton University, Department of Geosciences, Princeton, New Jersey, USA. London School of Hygiene & Tropical Medicine, Department of Medical Statistics, London, UK. Indiana University Bloomington, Earth and Atmospheric Sciences, Emeritus, Bloomington, Indiana, USA. Imperial College London, Department of Earth Science & Engineering, London, UK. RISE, Research Institutes of Sweden, Department of Methodology, Textiles and Medical Technology, Stockholm, Sweden. Japan Aerospace Exploration Agency ( JAXA), Institute of Space and Astronautical Science, Sagamihara Kanagawa, Japan. University of Tokyo, Graduate School of Science, Tokyo, Japan. Université de Paris, Institut de Physique du Globe de Paris, Paris, France. European Institute for Marine Studies (IUEM), CNRS-UMR6538 Laboratoire Geo-Ocean, Plouzané, France. Conseiller Scientifique, Innovaxiom, France. Gerhard Kminek et al., 2022; Published by Mary Ann Liebert, Inc. This Open Access article is distributed under the terms of the Creative Commons License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. ASTROBIOLOGY Volume 22, Supplement 1, 2022 Mary Ann Liebert, Inc. DOI: 10.1089/ast.2022.0017


Introduction
A nalyzing martian samples in terrestrial laboratories would advance our understanding of Mars in multiple ways that are impossible when using in situ missions or martian meteorites alone. Most recently, the Mars Sample Return (MSR) Science Planning Group 2 (MSPG2) produced an up-to-date status of MSR science planning Meyer et al., 2022).
With the expected benefits of MSR, however, come responsibilities. If life is present on Mars, then samples from Mars could be a source of extraterrestrial biological contamination for Earth. In line with Article IX of the United Nations Space Treaty (UN Space Treaty, 1966), a range of measures that are described by the Committee on Space Research (COSPAR) Policy on Planetary Protection would have to be employed to prevent undesirable consequences for Earth's systems (DeVincenzi et al., 1998;COSPAR, 2021). One of these measures is to conduct a timely safety assessment of any unsterilized material from Mars. The first step to develop such a safety assessment began under the leadership of the National Aeronautics and Space Administration (NASA), with contributions from the Centre National d'É tudes Spatiales (CNES), in 2000 with a series of five workshops that led to a Draft Test Protocol (Rummel et al., 2002). An important recommendation of this earlier work was to periodically review and update the Draft Test Protocol by taking into account new scientific findings and advances in instrumentation. As an intermediate step and response to this recommendation, NASA and the European Space Agency (ESA), in coordination with COSPAR, organized a life detection conference and workshop in 2012 to discuss the latest concepts and methods to search for life and identify relevant elements for a safety assessment (Allwood et al., 2013;Kminek et al., 2014).
With an increased interest in a joint NASA-ESA MSR Campaign and associated planning activities under-way , the need to produce an updated version of the safety assessment became evident. This is reflected in one of the recommendations of the International Mars Architecture for the Return of Samples (iMars) Phase II Working Group (Haltigin et al., 2016): ''A Planetary Protection Protocol should be produced as soon as it is feasible by an international working group under the authority of COSPAR or another international body.'' This need is also described with additional contextual information in the work of Rummel and Kminek (2018).
COSPAR swiftly reacted and established a Sample Safety Assessment Protocol (SSAP) Working Group in 2018. This Working Group had the mandate to review existing literature and the planned MSR Campaign architecture to produce a sample safety assessment protocol. The mandate for the SSAP Working Group specifically excluded biosafety control and management aspects, that is, sterilization of material from Mars, environmental and health monitoring, containment elements, and contingency planning. The Working Group had members covering the relevant expertise in life detection, public health, infectious diseases, physical and chemical composition of expected material from Mars, extraterrestrial sample analysis, sample curation, and statistical analysis. Additional experts have been invited to participate in specific meetings. In particular, a team from the US Centers for Disease Control and Prevention (CDC) was invited to comment on the draft and final versions of this report. Collectively, the external input has produced substantial added benefits in the SSAP Working Group's deliberations.
Toward the end of the Working Group's term, the name for our product was reconsidered. It was felt that it would be more appropriate to call this a framework rather than a protocol (or a draft protocol) to better represent the content. A detailed (or even draft) protocol will need to be developed once a number of open issues addressed throughout this report and summarized in Section 5 are resolved and more information about the samples is available.
Some general remarks to better understand this Sample Safety Assessment Framework (SSAF) include the following: For the purpose of formulating the SSAF, we considered the NASA-ESA MSR Campaign . We are using the term sterilization in a generic way to include both overkill (i.e., a process with substantial margin that does not require viability testing after application but would typically render samples useless for further biochemical investigations) and inactivation (i.e., a process with less margin that requires viability testing after application and would likely allow for conducting certain biochemical investigations after it has been applied).
The following sections describe the scope, structure, and content of the SSAF. Section 5 describes the key elements of the SSAF. These key elements are not independent and must be taken together with the remainder of the report for context and for additional and essential information. For elements of the SSAF the Working Group considered mandatory or very important, we use the term ''must'', while for elements that only have an indirect effect (e.g., making the assessment faster or using less material) we use a conditional form.

Objective and Scope of the SSAF
The objective of the safety assessment is to assess whether martian life is present that would pose a risk for Earth's systems (e.g., environment, biosphere, geochemical cycles) in samples intentionally returned from Mars. Traditionally, risk is defined in terms of probability of occurrence and consequences. In our case, we do not know and could only speculate about the consequences of releasing potential martian biology on Earth. For the purpose of the SSAF, we use the term risk not in relation to consequences but to the release of active martian biology exclusively. The associated risk mitigation is based on two pillars: performing a safety assessment and/or sterilizing the material from Mars.
One of the assumptions we use in terms of potential martian biology is that it is based on carbon chemistry. The likelihood that extraterrestrial life is carbon based has been suggested and discussed in various publications with arguments focused on the versatility and abundance of carbon in our Solar System and beyond (e.g., NRC, 2002;Allwood et al., 2012;Craven et al., 2021). It is worth noting that even theoretical concepts of silicon-based life still employ organic moieties (Petokowski et al., 2020). Another assumption used is that any potential life on Mars utilizes soluble organic compounds. Organic molecules used by terrestrial life are soluble in either polar or nonpolar solvents. Any potential life based on insoluble organic molecules would be unlikely to cause harm to terrestrial systems (Dirk and Irwin, 2005). In addition, solid-solid reactions are very slow compared to those in solution. An intractable solid may be hazardous, though without the capability of interacting in fluids it would not be able to replicate on timescales that compete with those of terrestrial biological systems.
We agree with the National Research Council (NRC) Committee on the Review of Planetary Protection Requirements for Mars Sample Return Missions that ''the potential risks of large-scale effects arising from the intentional return of martian materials to Earth are primarily those associated with replicating biological entities, rather than toxic effects attributed to microbes, their cellular structures, or extracellular products'' (NRC, 2009). In addition to replicating biological entities, we consider it prudent to include biologically active molecules in the sample safety assessment (ESF, 2012;Craven et al., 2021). This expansion of the SSAF covers the incorporation of potential martian non-self-replicating biological agents that could lead to a redirection of life processes on Earth (i.e., viruslike, stray RNA or DNA-like, and prion-like entities) and even theoretical concepts of propagating catalytic reactions that may directly precede de novo life. Although it might be easier to find life that is the producer or host for such agents, the Mars returned sample safety assessment must have the capability to detect biologically active molecules independently as well. Throughout the SSAF, we use the term martian life to include both de facto martian life and biologically active molecules produced by martian life. There is also a very real possibility that there are toxic compounds present in the samples, for example, inorganic species such as perchlorates. Toxic effects that originate from the samples are not covered in the SSAF because they are limited to an occupational hazard and can be managed accordingly.
In the case that martian life is found in samples returned from Mars, large-scale negative effects on Earth's systems are not expected (NRC, 1997;NRC, 2009;ESF, 2012). However, it is impossible to exclude absolutely such consequences. Thus, a prudent and conservative approach is the most appropriate response-be ready for the unexpected (NRC, 1997;NRC, 2009;ESF, 2012).
There are many ways an alien life form could be harmful to Earth's systems. The possible interactions could include not only direct effects on humans, animals, and plants or their associated beneficial microbes, but also indirect effects of a competitive interaction with various terrestrial species. Examples abound of the detrimental effects that result from terrestrial invasive species being introduced into new environments (e.g., van der Putten et al., 2007;Litchman, 2010;Randolph and Rogers, 2010). There are also more subtle effects that can be imagined. Some microorganisms, while not obviously beneficial to humans, are actually keystone species, the loss of which could cause irrevocable harm to an ecosystem (e.g., Mills et al., 1993) or disrupt essential biochemical cycles (e.g., Jardillier et al., 2010). Unfortunately, we have only a limited ability to predict the effects of terrestrial invasive species, emerging pathogens, and uncultivated microbes on Earths' ecosystems and environments. This is true even for cultured and fully genomesequenced terrestrial organisms and more so for potential extraterrestrial life. Thus, conducting a comprehensive sample safety assessment with the required rigor to predict harmful or harmless consequences of potential martian life for Earth is currently not feasible. This situation is not likely to change substantially within the next decade. On the contrary, the increased knowledge accumulated over the last decades has shown many more unexpected effects and dependencies in the various ecosystems of Earth (e.g., Pejchar and Mooney, 2009). Therefore, the scope of the SSAF is limited to evaluating whether the presence of martian life can be excluded in the samples without pretending to assess the potentially hazardous nature of the samples-except that if there is no life, there is no biological hazard. This position is in line with the NRC Committee on Mars Sample Return Issues and Recommendations: ''Evaluation of the sample for potential hazards should focus exclusively, then, on searching for evidence of living organisms, their resting states (e.g., spores or cysts), or their remains in the sample'' (NRC, 1997).
Although this approach might lead to an impression that the SSAF is essentially a life detection framework, this impression would be incorrect. There are very important and clear distinctions between the general search for martian life in returned samples for purely scientific purposes and the assessment to exclude the presence of martian life in them. The SSAF is starting from the positive hypothesis-''there is martian life in the samples.'' Testing this hypothesis, that is, excluding the presence of martian life, is complementary to the scientific objective to search for martian life. Science investigations and the sample safety assessment use the same scientific methodologies, though the purpose and associated burden of proof is reversed. Disproving either the positive (safety) or null (science) hypotheses to a certain level of confidence can only be accomplished by collecting sufficient statistical data. Meeting the objective to disprove the null scientific hypothesis is typically constrained by the available resources in terms of budget and time. The constraints of disproving the positive safety-relevant hypothesis is much less dependent on resources and more dependent on the acceptable risk or the acceptable level of assurance that a risk will be avoided. As a consequence, the search for martian life science objective will benefit from the increased rigor required by the safety assessment, given that it will utilize the same scientific methods and tests required to address the search for martian life objectives. Thus, all samples used for the safety assessment and all tests done on these samples will have a scientific value.
To emphasize again, the SSAF is not a life detection framework. There are life detection frameworks in discussion and under development (e.g., Green et al., 2021;Graham et al., 2021). These are timely efforts to assess the validity and confidence for evidence of extraterrestrial life and ways to communicate this information effectively. Finding evidence for life typically follows an incremental path until definitive evidence is reached by a consensus in the scientific community (Green et al., 2021). Due to the reversed burden of proof for the safety assessment, any ambiguous results (e.g., maybe abiotic, maybe terrestrial contamination, maybe masking martian life) would not disprove the positive hypothesis until a clear root cause is identified and confirmed. Any step toward an agreed upon framework for life detection established by the science community would certainly help to reduce some uncertainties in the safety assessment and is therefore encouraged.
The following principles, derived from a Life Detection Conference & Workshop (Allwood et al., 2013;Kminek et al., 2014), reflect the interplay of science and sample safety assessments and provide the basis for the SSAF: 1. Use of a hypothesis-driven approach in the development of life detection investigation strategies and measurements for science (null hypothesis) and sample safety assessment (positive hypothesis). 2. The same types of scientific measurements inform the scientific understanding of the samples and their safety assessment.
3. A sample safety assessment must be data-driven, i.e., responsive to the results of individual or combined investigations. 4. The distinction between the scientific objective to search for martian life and the sample safety assessment is mainly the degree of rigor and supervision applied, which is described in this framework.
Unlike a scientific objective to search for life on Mars, the scope of the SSAF is limited to exclude the presence of martian life in the samples from Mars. Taking into account the diversity of samples and the microscopic distribution of potential life in macroscopic samples, every sample tube is considered a separate sample. A negative result (i.e., no martian life) associated with samples from a sample tube would provide a certain pre-defined level of assurance that there is no life and therefore no hazard for Earth in that sample tube. Such a determination cannot be extrapolated to other sample tubes nor can it be extrapolated to the planet Mars. A positive result for one or more samples would not necessarily mean they are hazardous for Earth. Any positive result would lead to a Hold & Critical Review (see Section 3.4). A deeper understanding of how any newly discovered biology works and what kind of capabilities it has would require detailed understanding of the metabolism, informational macromolecules, and replication of this extraterrestrial life. As on Earth, it is unlikely that life is represented only by one of its members, that is, if we discover a single martian life form, we would possibly discover more than one member of a martian biology. This, together with the fact that we do not even know how to cultivate most terrestrial microorganisms makes it essential to manage expectations in terms of the possibility to conduct a proper hazard assessment. This aspect is further detailed in the implementation part of the SSAF (Section 4).
There is one open parameter that must be introduced to the sample safety assessment-the level of assurance required to declare a sample safe. This parameter would describe the stopping threshold, that is, level of confidence in the statement ''the presence of martian life is excluded in this sample.'' Setting such a level is important to avoid open-ended discussions and better estimate the efforts and resources necessary to conduct the sample safety assessment. For the purpose of running simulations and test cases, we have taken a value of ''1 in a million chance of failing to detect life if it is there.'' For details on the background of this canonical value, the reader is advised to consult the ESF Study Group Report on MSR Planetary Protection Requirements (ESF, 2012).

Elements of the SSAF
There are four elements in the SSAF. Each is necessary, though on its own not sufficient to qualify for a safety assessment. The four elements are ( Fig. 1): of techniques or instruments that could provide the information required by the SSAF. This list of candidate instruments is not a set of required or endorsed instruments but has been established for planning purposes.

Bayesian statistics
Bayesian reasoning and methods of statistical analysis are standard approaches with which to address complex statistical issues (Greenland, 2021) and are widely used in medical decision making (Hunink et al., 2014). Bayesian statistics can accommodate various forms of information and help to optimize limited resources, like sample material or time. Therefore, Bayesian statistics is considered an appropriate tool for the SSAF.
When little prior information is available, the Bayesian and frequentist statistics will generally yield very similar results (Rothman and Lash, 2021). When prior knowledge can be incorporated, whether for decision making in medicine or assessments of Mars samples, and a series of tests are to be used with the results being updated after each test, Bayesian statistics is more applicable and appropriate (Hunink et al., 2014;Greenland, 2021). In our case, it is necessary to specify an a priori probability that there is martian life in a sample tube. The information acquired by the NASA Mars 2020 mission (Farley et al., 2020) can be used to make an informed judgement about the a priori probability of finding martian life in a sample tube before actually starting any testing. This informed judgement must reflect the conservative posture of a positive hypothesis. The results of applying tests on one sample tube can also inform, together with the other Mars 2020 information, the a priori probability of finding martian life in subsequent sample tubes. Recall, however, that a sample safety determination cannot be directly extrapolated from one sample tube to another one.
3.1.1. Sensitivity and specificity. In addition to establishing a pre-test probability (a priori), the other quantities that need to be estimated before Bayesian statistics can be applied are the sensitivity (Sn) and specificity (Sp) of the test. One complication is that terrestrial biological contamination would impact the specificity of the test, that is, leading to a false positive. There is also another complication: even if there is life somewhere in the sample tube, there is no guarantee that there will be life in the subsamples that are examined. Thus, the sensitivity of the test for the sample of a specific sample tube depends on both the sensitivity of the test and the capture rate, that is, the probability of finding martian life in a subsample, if there is in fact martian life somewhere in the sample material inside the sample tube, which is certainly less than 1.0 (i.e., less than 100%). The effective sensitivity (ESn) is the product of the sensitivity and the capture rate.
3.1.2. Driving factors for the safety assessment. If the presence of martian life in one of the subsamples cannot be excluded, then in terms of the safety assessment, we need to assume that there is a high probability that life is present.
FIG. 1. The four elements of the SSAF. There a multiple interdependencies between the various elements. The major external input parameter required is the level of assurance that something is safe. Some of the parameters need informed judgements based on Mars 2020 in situ data and tailored analogue test programs.
Note that this conservative approach, taking the precautionary principle into account (Pearce, 2004), assumes that the specificity of the test is 1 (i.e., 100%), i.e. that a false positive cannot occur. Although the test would certainly be chosen to minimize the chance of a false positive, there is still the issue of terrestrial biological contamination that could at least bias the results (see Section 4.3 for more details). With this background information, how many negative test results are actually required before it can be concluded that the positive hypothesis (i.e., that there is martian life in the sample tube) has been ''refuted'' (i.e., the probability of life being present is less than a pre-defined level of assurance)? A theoretical example can illustrate this, including the dependency of the required number of negative tests on the various parameters. More information about the relationship of samples and subsamples is described in Section 3.2.
Using the following assumptions: Pre-test probability = 0.50 Sensitivity = 0.99 Capture rate = 0.75 Specificity = 0.99 The following can be derived: Effective sensitivity = 0.99 · 0.75 = 0.7425 Pre-test odds = 0.50/(1 -0.50) = 1.00 Positive Likelihood Ratio (PLR) = 0.7425/0.01 = 74.25 Negative Likelihood Ratio (NLR) = (1 -0.7425)/0.99 = 0.26 If then the test is applied to the first subsample, and the results are negative, then the post-test probability can be calculated as follows: Post-negative-test odds = pre-test odds · NLR = 1.00 · 0.26 = 0.26 Post-negative-test probability = 0.26/(1 + 0.26) = 0.206 Having one negative test reduces the probability that there is life in the sample from 0.50 (pre-test) to 0.206 (posttest); this value is now used as pre-test probability for a second test; a second negative test reduces the probability further to 0.063, etc. Table 1 shows the number of sequential negative tests required for the post-test probability to become less than 1 · 10 -6 (i.e., 1 in a million), under a variety of assumptions.
This exercise shows two important results ( Table 1): The capture rate is crucial to this process. With a capture rate of only 0.25 (25%), a pre-test probability of 0.95 (95%) and both sensitivity and specificity at 0.95 (95%), it would require at least 77 negative tests (and no positive test) before one could conclude that the post-test probability is less than 1 · 10 -6 . On the other hand, if the capture rate is 0.75 (75%), then only 15 negative tests (and no positive test) are required. For an unrealistic capture rate of 1 (100%), only 6 tests would be required. The sensitivity and specificity of the overall test sequence is important as well but to a lesser extent than the capture rate. Only when values go much below 0.9 (e.g., 0.7) would this markedly increase the number of negative tests required before one could conclude that the post-test probability is less than 1 · 10 -6 .
The capture rate for a natural sample and the sensitivity and specificity of a real test sequence can only be estimated by an informed judgement. The elements that are necessary to enable such an informed judgement are described in Sections 3.2 and 4.2.

Subsampling strategy
As described in the previous section, the capture rate has a major impact on the number of negative tests required before the samples from a sample tube can be declared safe with a pre-defined level of assurance. The number of negative tests (as defined in Section 3.3) required is equivalent to the number of subsamples of a sample in a sample tube that need to be tested. As some parts of natural samples are more likely to contain life than the rest (e.g., Onstott et al., 2019), it is typically not appropriate to apply random sampling. To maximize the probability that a subsample will contain martian life (i.e., increase the capture rate) if there is martian life somewhere in the sample of a sample tube, informed, targeted sampling is required. Random sampling or a poorly informed targeted sampling will reduce the effective sensitivity and thus would lead to a substantially higher number of negative tests (i.e., number of subsamples) required before a sample from a sample tube could be declared safe with a pre-defined level of assurance. This is illustrated in Table 1: a poor capture rate of 0.25 (25%) compared to a good one of 0.75 (75%) would require more than 60 additional subsamples to be processed (each with a negative result) before reaching the same level of assurance. Thus, an informed targeted sampling strategy needs to be applied to reach capture rate levels, ideally, above 0.5 (50%). Such a strategy requires a focus on the areas, characteristics, and features of the samples that are likely to contain martian life, taking into account the type of sample and the expected distribution and patchiness of life in the samples associated with fractures, veins, and general interconnected spaces as well as chemical interfaces and boundaries (e.g., Gorbushina, 2007;Cockell et al., 2019;Onstott et al., 2019;Brady et al., 2020;Suzuki et al., 2020). Many sample tubes will undoubtedly contain samples with diverse features that, depending on their categorization, could number from a few to a large number of distinct sections of each sample. However, it should be noted that such targeted sampling needs to be balanced, for example, by containing an appropriate mix of high-probability and medium-probability sites, rather than solely sampling from those sites with features that are considered to have the highest probabilities of containing life. The first step in this process is to obtain information about the 3-dimensional (3-D) morphological characteristics of the external and internal structures of the sample at a micrometer scale (i.e., 1 · 10 -6 meter) resolution. Though this spatial resolution is not necessarily sufficient to find morphological evidence of life, it is sufficient to image physical features that could contain such information (e.g., fractures, veins, and general interconnected spaces). To select optimal targets and establish priorities for subsampling, spatially correlated chemical and mineralogical information is required as well (e.g., Onstott et al., 2019).
Airfall or windblown dust is a special case in this sampling context. Although dust might be sorted to a certain degree during sample acquisition and transport, random sampling for dust samples is an adequate approach as long as the dust samples are homogenized (i.e., mixed) before subsamples are taken. It is worth noting that the serendipitous dust on the sample tubes is likely not of sufficient quantity to perform a sample safety assessment. This would be the case as well for any dust components inside the various sample tubes; hence, it is questionable whether small quantities of dust can be declared safe (i.e., free of martian life) based on a sample safety assessment. Clays are another special case. There are clays formed by local aqueous alteration (e.g., smectite coatings on weathered feldspars), and there may be clay-rich mudrocks that are typically homogenous in terms of the distribution of the clays. For determining the right subsampling approach, in-formed targeted subsampling is appropriate for clays formed by local alteration associated with distinct features (e.g., fractures) within lithified clay rocks. By contrast, clay-rich ''muds'' would be more suited for random subsampling. This informed approach can be generalized for various types of fine-grained minerals, that is, targeted subsampling for localized fine-grained alteration products or localized features, such as fractures in lithified fine-grained rocks, and random subsampling for unconsolidated fine-grained sediments. These approaches must be tested and confirmed by using terrestrial analogue material (see Section 4.2).
A further consideration in selecting subsamples is that each subsample must be independent (conditional on the targeted sampling strategy), that is, each subsample needs to be from different parts of the sample. If this is not done, for example, if all subsamples are selected from the same section of the same crack, the subsamples would not be independent, and the assumptions of the Bayesian analysis would no longer be valid (and neither would the assumptions of standard ''frequentist'' statistics). The independence of sampling is not an issue for dust samples since these are assumed to be homogenous.
The information about the sample, however, is only one element in developing a credible and robust targeted sampling strategy. The specific martian sample information must be linked with a knowledge base, that is, experience with similar terrestrial sample types, including dust samples. To establish such a knowledge base requires an analogue test program tailored to the expected types of samples from Mars (i.e., information from Mars 2020) and the kind of measurements that will be used to establish the information for deriving the capture rate (i.e., 3-D structural information and the spatially associated chemistry and mineralogy). For more information see Section 4.2.
Bayesian statistics provide an estimate of the number of subsamples necessary to reach a pre-defined level of assurance that the sample in a sample tube is safe. This is a very important aspect of the sample safety assessment because it facilitates planning with regard to the resources (e.g., time, number of subsamples) required for individual sample tubes. It also helps to establish a strategy that optimizes the sequence of investigations required for analyzing the available sample tubes. The amount of material for each subsample depends on the sensitivity of the test in relation to the resolution required for the sample safety assessment. Thus, any available technique that has been properly vetted and meets or exceeds the measurement requirements should be considered for use.

Test sequence
In the previous sections about Bayesian statistics and the subsampling strategy, the term ''test'' is used in a very generic form. Unfortunately, there is no single ''test'' that can be applied to acquire all of the information necessary to perform a sample safety assessment. What is actually required is a set of investigations in a specific logical order that will inform the sample safety assessment. This set of investigations, referred to specifically as the ''test sequence,'' is focused on the type of information necessary for sample safety assessment rather than the specific techniques and instruments required to gather it. Any statements made about a ''test'' in the previous sections (e.g., number of negative tests, dependencies of tests, etc.) are actually applicable to the test sequence. For example, the number of negative tests required to reach a pre-defined level of assurance that there is no martian life in a sample tube is actually the number of negative test sequences applied on the equivalent number of subsamples, that is, 3 negative tests means that the test sequence was applied on 3 different subsamples, each time with a negative overall result. It should be emphasized that all of these tests will produce scientifically important data.
3.3.1. Test sequence steps. The first investigations in the test sequence are derived directly from the need to support the targeted subsampling (see Section 3.2).
Three-dimensional structural information-Step 1. This investigation is focused on the 3-dimensional (3-D) morphological characteristics of the external and internal structural features of interest at micrometer-level spatial resolution while the samples are still in the sealed sample tubes. The importance and usefulness of this kind of analysis has been previously described by McLennon et al. (2011) and Tait et al. (2021). Together with the information provided by the Mars 2020 mission (Farley et al., 2020), this information will enable the assignment of a risk level (i.e., chance to find martian life, pre-test probability) to each sample tube. The use of Bayesian statistics and considerations of the dependencies of the various parameters (pretest probability, capture rate, sensitivity, specificity) makes it obvious that the samples with the highest risk index (i.e., highest chance to find martian life) should be opened and analyzed first because these samples will provide the highest confidence in the results-positive or negative. They would also produce the most robust pre-test probability for any subsequent samples and altogether optimize the use of sample material. After having selected a specific sample tube for further analyses, the 3-D structural information forms the basis of a risk-informed subsampling strategy.
Gas analysis-Step 2. There are many scientific reasons for analyzing the headspace gas inside a sample tube (Swindle et al., 2022;Tosca et al., 2022;Velbel et al., 2022). In fact, analysis of the sample tube headspace gas is likely to be one of the first direct measurements of martian material beyond the indirect investigation described in Step 1. At the most basic level, analysis of the headspace gas might inform whether a tube has leaked (Parai et al., 2021). Having been sealed on Mars, the tubes will be at negative pressure relative to ambient Earth pressure, so immediate leakage will be of terrestrial atmosphere into the sample. If terrestrial atmosphere has leaked into the tubes, then it would have occurred during atmospheric entry or when the capsule was on the ground awaiting recovery. In either case, bacteria, dust, or other air-borne particulate matter may have been carried into the sample tubes as well, depending on the nature of the leak. Deposition of such matter on the martian samples has the potential to create false positives, assumedto-be martian species, or overprint a true positive signal of martian life (Milam et al., 2021). The gas analysis would be important for planning the sequence of operations for opening the individual tubes, as well as for the interpretation of the data to know early on which tubes might be com-promised by terrestrial contamination. Knowing which tubes are compromised will also be a key factor in determining the extent to which contamination knowledge samples will be required to deconvolve any terrestrial life signals in a sample from any potential martian signals that are also present (refer to details in Section 4.3).
Chemistry and mineralogy associated with the 3-D structural information-Step 3. These investigations focus on the acquisition of information about the chemistry and mineralogy associated with 3-D structural features of interest in a sample (e.g., fractures, veins, and general interconnected spaces). This could be done at the same time that 3-D structural information is acquired on samples while still in their respective sealed sample tubes (Step 1) or, subsequently, once sample material is removed from the sample tubes. The benefit of the latter approach is that the quality and spatial resolution of the chemical information acquired on sample material removed from the sample tubes might be better. Such chemical and mineralogical information is essential to refine the subsampling strategy (Tait et al., 2022;Carrier et al., 2022), which is based on the 3-D structural information, and of particular importance to optimize the subsequent use of sample material.
To put these first investigations in the proper context, it is useful to describe briefly the expected initial sample characterization steps in the frame of the sample curation activities (Tait et al., 2022). The initial sample characterization covers three distinct phases: Pre-Basic Characterization (Pre-BC), Basic Characterization (BC), and Preliminary Examination (PE) (Fig. 2). The first investigation required in the SSAF, 3-D morphological characterization of the external and internal structures at micrometer-level spatial resolution (Step 1) while the samples are still in sealed sample tubes, overlaps with the Pre-BC investigations.
Step 3-chemical and mineralogical information associated with features of interest in the sample structure-overlaps with the BC and PE investigations (Fig. 2). These overlaps are beneficial because the set of investigations serve three functions-curation, science, and sample safety assessment.
Steps 1 and 3 of the test sequence provide information about how many and which subsamples to take from the sample of a sample tube. Products and effects of life in a host rock are generally volumetrically more significant than life itself (Onstott et al., 2019). Therefore, it is possible that the results of these first steps would provide initial indications of life, in addition to refining the targeted subsampling. Morphological indications consistent with life are a special case in this context. Independent of the analytical process used, morphology alone can be misleading. There is a long history of incorrect interpretations of cell-like morphologies as evidence of fossilized life (see Section 3.3.2). What actions follow, in particular for Step 4, depend upon the associated chemistry and whether any morphological feature of interest is unique and an isolated observation or a common constituent of a sample. Unlike the scientifically relevant null hypothesis, the sample safety assessment is focused on the positive. Therefore, targeted investigations for Step 4 require morphological and chemical information that exclude a martian biological origin of common and unique features in the samples rather than just attempt to confirm their potential biological origin.
Organic molecules-Step 4. This step initiates the search for molecular evidence of martian life in targeted subsamples. The search strategy is based on the assumption that potential martian life is based on carbon chemistry. Therefore, the subsequent investigations (Steps 4, 5, and 6) must include a focus on locating, identifying, and characterizing organic compounds in the subsamples. Steps 1 and 3 of the test sequence are critical in the search for any organic molecules that might be associated with life because, like on Earth, it is expected that life is spatially clustered and not homogenously distributed in the host rock, and that the bulk organic content of the host rock is not necessarily correlated with the presence or absence of life (e.g., Onstott et al., 2019;Suzuki et al., 2020).
For the purpose of the SSAF, organic molecules are defined as a group of covalently bonded molecules that contain carbon and at least one other element. We exclude CO, CO 2 , CO 3 2-, carbides, graphite, and steel from this functional definition of organic. Insoluble organic matter (IOM), as delivered by meteorites, and kerogen that originated from extinct life are also excluded because such substances consist of molecular compounds that are not soluble in polar or non-polar solvents. Examples of included organic compounds are mellitic acid, urea, CS 2 , CCl 4 , methane, carbon suboxide, Prussian blue, polycyclic aromatic hydrocarbons, and obviously organic species like lipids, amino acids, aldehydes, etc. To properly characterize any specific organic compounds in returned samples from Mars, it is necessary to use destructive techniques. The decision about whether to apply in situ techniques or bulk extraction-based techniques will require information from the previous investigations (i.e., Steps 1 and 3). In situ based techniques are less likely to conclusively identify any specific organic molecule because they rely on identification of only one type of information-either mass (e.g., Matrix Assisted Laser Desorption/Ionization Mass spectrometry (MALDI-MS), Time of Flight Secondary Ion Mass Spectrometry (ToF-SIMS)), or functional group (e.g., Raman spectroscopy, infrared spectroscopy, deep ultraviolet fluorescence). For some techniques, a substantial interference from the mineral matrix is expected. The advantage of in situ based techniques is that a result can be spatially associated with observed features. In situ analysis is the preferred approach in those cases where compelling morphological or chemical evidence of life are detected. Bulk extraction-based techniques (e.g., Liquid Chromatography Mass Spectrometry (LC-MS), Gas Chromatography Mass Spectrometry (GC-MS)) provide two types of information for identification of organic molecules-the time it takes for a compound to pass through a chromatographic column (retention time), and the mass and fragmentation behavior of the molecule as measured by the mass spectrometer. Although these two types of information improve the reliability of identifying specific organic molecules, extraction-based techniques eliminate the direct spatial association with structural features of the sample and could dilute a localized low biomass signal (i.e., reduce the sensitivity). However, in those cases when the evidence for possible life is more widespread in a sample, extraction-based techniques could also increase the sensitivity because they typically sample a larger volume. It is acknowledged that organic molecules occupy a wide range of polarity space, and thus no single solvent will extract all compounds. This affects the amount of sample material that needs to be used for bulk analysis of each subsample. Further, any binding of life and organic compounds to mineral surfaces will require additional steps (such as hydrolysis) to release them (Mitra, 2004). Sample extracts from a specific subsample could be split for analyses by multiple complementary techniques. It is very important that all blanks be processed in the same manner as a sample of interest.
Molecular patterns-Step 5. If regions of organic-rich material are identified in a subsample, it is necessary to characterize the molecules present. Compound-specific measurements are required to search for molecular patterns. The targets of interest are small organic compounds, such as those found in biological monomers or biochemical intermediates, while the characterization of larger molecules is covered in Step 6. Molecular patterns are defined as a limited suite of organic compound abundances distinct from what would be produced abiotically with respect to structural diversity, chirality, and stable isotopes. For example, abiotic reactions tend to produce organic compounds at decreasing abundance with increasing molecular weight and show a lack of chemical specificity (e.g., biological vs. meteoritic amino acid abundances or Fischer-Tropsch hydrocarbons vs. even-numbered biological fatty acids). With the exception of certain meteoritic compounds, molecules produced from abiotic reactions show no chiral preference (Glavin et al., 2019). Glavin et al. (2019) provided a framework for using structural diversity, chirality, and stable isotopes together to evaluate possible biological origins of a compound, and they cautioned that any one of these indicators would be insufficient to indicate biology. Though this framework is science driven, it should be acknowledged that, for sample safety assessment purposes, the aim is to exclude biological origin. It is difficult to generate a predetermined life detection or life exclusion test from molecular patterns because of the likely co-existence of mixtures of several end member organic compounds that include those from active and prolific biology, degraded biological compounds, degraded abiotic organic compounds, and abiotic chemistry. As in the previous step, sample extracts could be split for analysis by multiple complementary techniques, and blanks must also be analyzed in parallel. Molecules most likely to be detected in this step are amino acids, nucleobases, sugars, lipids, and pigments.
Macromolecules-Step 6. The next step in the SSAF is designed to target polymeric or other large molecules to search for patterns in order to differentiate abiotic macromolecules, such as meteoritic insoluble organic material, from biological molecules, including, but not limited to, deoxyribonucleic acid (DNA), ribonucleic acid (RNA), proteins, and polysaccharides. For the purpose of the SSAF, a macromolecule is defined as an organic compound with molecular weight greater than 2500 Da (Dalton). This limit is derived from taking one half of the mass of the smallest known functional macromolecules in terrestrial biology. For example, the smallest prion is 300 kDa (Silveira et al., 2005), the smallest enzyme is 66 residues, or 6811 Da (Chen et al., 1992), and the smallest ribozyme is 16 nucleotides, or 5233 Da (Scott et al., 1995). This limit is also smaller than the smallest of the well-studied RNA in vitro replicating systems with a mass of about 15 kDa (Oehlschläger and Eigen, 1997). It should be noted that the smallest amyloid is an 8-residue domain, or 800 Da (Gazit, 2007;Sabate et al., 2015), and that amyloids, transmissible epigenetic regions in a larger protein, need to be in high enough concentrations to form fibrils (Sabate et al., 2015). Such a concentration of peptides would be strong evidence for life, but defining a macromolecule so broadly is likely to generate more falsepositive detections than is useful. Metabolic only hypercycle-like life (Eigen and Schuster, 1997) would lack informational macromolecules but is unlikely to be able to outcompete terrestrial biology and pose a threat. Nevertheless, such a biological system would show a strong positive signal for the previous investigations but fail the current investigation step, and must be investigated further for the potential for life. Similar to the previous steps, sample extracts could be split for analysis by multiple complementary techniques, and blanks must also be analyzed in parallel. Molecules likely to be detected in this step include proteins.
Life as we know it-Step 7. Hallmarks of terrestrial life include ATGC-based DNA, AUGC-based RNA, proteins comprising 20 L-amino acids, lipids (i.e., fatty acids, phospholipids, etc.), and glycopeptides, such as peptidoglycan and polysaccharides (e.g., cellulose). Detecting life as we know it assumes that, if there is a living organism, it relies on the same chemical processes as terrestrial organisms and thus differs from the agnostic approach described in Step 8. To improve the sensitivity in what is expected to be a low biomass scenario requires the use of amplification steps (see Section 3.4). There are two types of life detection techniques that amplify specific targets of interest: cultivationdependent and cultivation-independent, both with varying degrees of sensitivity and specificity ( Table 2). The most useful techniques would be highly sensitive and have low specificity. Cultivation techniques theoretically have extremely high sensitivity in that one can grow a culture from a single cell, but the narrow bandwidth of any one combination of culture medium and growth condition makes culture-based approaches unpractical (see also Section 3.3.2). Alternatively, one can apply cultivation-independent techniques with amplification steps, like the polymerase chain reaction (PCR) for the amplification of nucleic acids. PCR sensitivity is high because billions of copies of a gene of interest can be derived from as little as one template copy. The usual target gene encodes small-subunit ribosomal RNA (SSU rRNA), which is a component of all terrestrial cells. PCR can also be very non-specific in that primers for SSU rRNA genes have been designed to have homology to all, or nearly all, members of each of the three evolutionary domains of life. In fact, these universal primers are routinely used to characterize microbial communities on Earth, including those in extreme environments. Sequencing of PCR-amplified SSU rRNA genes has revealed many new phyla (i.e., taxonomic rank in biology) of previously unknown life as we know it (e.g., Lloyd et al., 2018).
If life as we know it is detected in samples from Mars, the most likely explanation for this would be contamination from a terrestrial source. Contamination can occur during the assembly of the spacecraft and proceed all the way to analyses of returned material (McCubbin et al., 2019;Chan et al., 2020). Sequences of any PCR-amplified SSU rRNA genes derived from such samples could easily be compared to those from known spacecraft and spacecraft assembly facility contaminants (e.g., La Duc et al., 2014;Moissl-Eichinger et al., 2015;Koskinen et al., 2017;Regberg et al., 2020). Slim as the possibility is, there could exist life as we know it that is otherwise very different from known life on Earth, e.g., that evinces nucleic acid sequences and protein sequences that are so distinct from those in existing databases that one might conclude that they represent a life form that evolved on Mars rather than Earth. Such a conclusion would have to be made with the utmost care and with the hindsight that we are continually discovering novel terrestrial life forms and an increasing body of unannotated sequences in metagenomic datasets (i.e., microbial dark matter, Rinke et al., 2013). This has occurred partly through the development of new tools. Sequencing approaches revealed the existence of a third branch of life, the Archaea, only about 40 years ago; high-throughput DNA sequencing continues to unearth new microbial phyla. Also, the exploration of new, extreme habitats such as the deep ocean and the continental subsurface has greatly expanded our datasets. In other words, life as we know it is much more diverse than we knew just a few decades ago and may encompass even more forms by the time extraterrestrial samples are examined on Earth. Beyond self-replicating life, there are new viruses discovered on a monthly basis. This includes some very different classes of viruses, such as the giant viruses found across widespread habitats and ecological systems (e.g., Brandes and Linial, 2019) and newly confirmed bacteriophages that employ an alternate nucleobase (2-aminoadenine) in the genome (Zhou et al., 2021;Sleiman et al., 2021).
Another consideration if life as we know it is detected will be to ask whether it is alive. This is especially important for the sample safety assessment but also impacts the science. A range of analytical methods is available for determining microbial viability, each with its own sensitivity and specificity (e.g., Emerson et al., 2017). Each method uses a single criterion for determining life vs. death along what is actually a continuum, given that cells proceed from active to The search for life as we know it is facilitated by a vast knowledge of terrestrial life and the development of powerful tools for life detection and characterization. All techniques require an extraction step (e.g., solvents and/or physical agitation) to release the target of interest (i.e., life-form) from the mineral matrix and are destructive for the potential life-form under investigation (except for successful cultivation). Compatibility of using aliquots of one extract for more than one technique might only be possible for a few cases.
inactive and subsequently senesce and eventually disintegrate. Besides cultivation, currently available viability assessments are made on the basis of metabolic activity, positive energy status, and the detection and abundance of ribosomes, RNA transcripts, or intact membranes. Viruses and other infectious nucleic acids do not have any universal genes, and hence, they have no non-specific PCR primers or genetic probes. PCR of the main functional motifs (e.g., polymerases, helicases, receptor binding domains) that are most conserved among virus families could be used to look for viral signatures. Virus-like particles can also be stained with general nucleic acid stains and viewed by epifluorescence microscopy (Suttle and Fuhrman, 2010), though results can be ambiguous. Prions would likely not be distinguished by mass spectrometric analyses since they are misfolded versions of naturally occurring host proteins, yet a sensitive cyclic amplification of a protein folding assay that tests for protein misfolding of common proteins (e.g., Saborio et al., 2001) could be applied.
Life as we know it is quite varied, and the full range of possible lifeforms and their structure or the range of conditions within which they can survive remains unknown. In assessing the possibility of life on another planet, it is necessary to take into account the possibility that alternative nucleic acids, amino acids, electron transfer systems, and high energy bonds for driving metabolic activity could exist. Investigating such additional considerations is described in the next step.
Agnostic life detection-Step 8. Analytical methods that do not presuppose knowledge of the chemistry of a target life form (agnostic approaches) are especially useful for analyzing samples that contain unanticipated complexity. There are different metrics for complexity in chemistry that are typically associated with specific analytical techniques. Detection of complexity not seen in controls or anticipated by statistical models developed for agnostic analytical methods are interpreted as an indication of potential life that requires further study.
Earlier steps in the test sequence, especially Step 5, include analytical techniques that may inform an agnostic approach with regard to such features as particular classes of molecules, patterns within the molecular weights, or even intrinsic molecular complexity. There is a distinct need for novel techniques specialized for biochemical systems that do not share a chemical heritage with life on Earth. An expanded agnostic search for life could include molecules that are sufficiently complex but not associated with life on Earth (e.g., Marshall et al., 2021), discrete metastable accumulations of elements or isotopes that are not typical of abiotic geological or mineralogical process (e.g., Kempes et al., 2021), and disequilibrium redox chemistries that are not consistent with abiotic redox reactions (e.g., Frank et al., 2013).
To cast the widest possible net for life detection, the range of allowable interpretations for life must broaden. In addition to the expanded interpretive frameworks for typical methods, we present two concepts for agnostic life detection (see Section 3.3.3). Both require amplification and sequencing and explore the possibility for novel metabolisms that would not be detectable by typical biological methods (i.e., Step 7) yet also identify particles with surface chemistry characteristics typical of living organisms. These concepts could be used to recognize organic or inorganic evidence of life. Any concepts to be used, like those presented here, must address different forms of complexity (e.g., molecular vs surface binding complexities) and use orthogonal techniques in a sense that they use different interactions of analytical technique and sample. The logical consequence of this is also that one agnostic life detection methodology is not sufficient.

Diagnostic elements not explicitly used in the test sequence
Carbon. Life on Earth is based on carbon, which is present as a mixture of simple and complex organic molecules. As a guide to the search for life on Mars, it was assumed that carbon plays a similarly significant role. So, the search for life (extinct or extant) on Mars could be cast as the search for carbon. The rationale for searching for organic molecules is described in the work of Neveu et al. (2018). This search should be performed at the detection limits of available instrumentation, though it is acknowledged that the organic compounds released from a single cell in a given sample tube would be below the limit of quantitation (i.e., as required by Neveu et al., 2018) of the most likely instrumentation and that, in many cases, the detection of compounds of interest also means destroying them and disrupting any life present. A cell contains about 40 fg (femto-gram, 10 -15 gram) of organic molecules (Braun et al., 2016), and even the most sensitive technique likely requires at least 100s-1000s of cells in the sampled volume in order to be detected (e.g., Summons et al., 2014;Bhartia et al., 2010;Braun et al., 1999). Thus, the corollary, that if no carbon is detected there is no life, does not hold true. Hence, no lower limit for carbon detection is set for the test sequence in the SSAF.
Stable isotopes. The stable carbon isotopic compositions of living organisms on Earth are determined by the metabolic pathways that operate in the organisms. However, there is such a wide diversity of carbon isotopic compositions, and no single diagnostic composition or defined fractionation between nutrients and organisms, that the use of carbon isotopic composition as a diagnostic tool for life is substantially compromised. Combinations of isotopic compositions, for example, carbon, nitrogen, and sulfur, might help improve these limitations, though without knowing all the abiotic sources, sinks, and fractionation processes possible, this approach is still considered a weak diagnostic tool. Given that there are distinct isotopic differences between martian geological materials and geological materials from Earth (e.g., Barnes et al., 2020;Franchi et al., 1999;Franz et al., 2017;Füri and Marty, 2015;Shaheen et al., 2015), it is logical to presume that similar isotopic differences might persist between possible martian organisms and terrestrial organisms. Furthermore, it may be tempting to use isotopic composition as a means to differentiate between terrestrial and martian organisms. However, organisms often acquire the isotopic composition of their primary energy sources (i.e., their food) (e.g., Berry et al., 2015;Boschker and Middelburg, 2002;Jennings et al., 2017;Tykot, 2003); so terrestrial organisms that have subsisted on the elements in martian rock would likely inherit an isotopic composition like that of its environmental components. For these reasons, it is not advisable to rely primarily on the isotopic composition of potential biological material to identify martian life or differentiate whether any given life form discovered had a martian vs a terrestrial origin.
Solubility. Solubility is an important aspect in evaluating the potential harmful consequence of martian life on Earth. Terrestrial biology is solution based. Biochemical intermediates and macromolecules are soluble (or can be dispersed) in water or lipids. Cells and viruses can also be dispersed in water. Exceptions are various types of (naked) viruses or biological systems that could have minerals that cover the outside of a cell. With regard to the sample safety assessment, therefore, the concern is whether martian life is, and martian organic molecules are, soluble under physiological conditions. What is important for the test sequence is whether there are soluble organic molecules in extracts that can be detected and characterized by the analytical tools to discern what they are (i.e., via Steps 4-7). A separate investigation to assess the solubility of (organic) material in a sample is not considered required and would also unnecessarily consume sample material. For these reasons, solubility is not considered a standalone diagnostic tool but is indirectly addressed by the extraction processes for some chemical and biological analyses.
Metals. Living systems on Earth interact with a range of metals, including those that act as cofactors with enzymes. Roughly one-third to one-half of all known enzymes depend upon metal ions (e.g., Mounicou et al., 2009;Banci and Bertini, 2013). The most common metallic cofactors are Mg, Ca, Zn, Mn, and Ni; Fe and Cu are commonly redoxactive; and Co and Mo interact with coenzymes (Banci and Bertini, 2013;Madigan et al., 2019). The concentration of metals in cells and their association with cellular organic compounds suggest that metal profiles might be useful biosignatures. Indeed, the systematic biological study of these metal profiles has been termed ''metallomics,'' and the suite of metals associated with a cell is known as the ''metalome.'' Problems with relying on metal data for a sample safety assessment include: Their occurrence in concentrated form due to purely abiotic processes; The collection of cellular metal proportions, which varies among phylogenetically diverse microbial cells and in response to environmental parameters; The evolution of life on Earth to use particular metal ions based, in part, on their availability suggesting that life on Mars could potentially select for utilization of an entirely different set of metal ions than its counterparts on Earth.
For these reasons, metallomics is not considered a strong diagnostic tool for the SSAF.
Morphology. Morphological evidence of life can compound the challenges of life detection as cell-like forms can easily be produced by non-biological processes. The selfarrangement of lipid molecules with hydrophilic heads and hydrophobic tails in water is an example of how molecules with cell-like morphologies can be formed abiotically (e.g., Dworkin et al., 2001;Jordan et al., 2019). The chemical behavior and relative size of the hydrophobic heads of lipid molecules causes them to pack into a cell-like arrangement called a micelle, in which the hydrophilic heads face outwards toward the water and the hydrophobic tails are positioned toward the center of the 3-D micellar structure. Similar structures can be generated by polymers in water where a dense phase forms droplets within a more dilute phase and the droplets represent cell-like compartments. These entities, known as coacervates, were implicated in early origin of life models proposed by Alexander Oparin, who hypothesized that coacervates could have operated as protocells. Spontaneously formed cell-like structures can also leave residues that can be misinterpreted as life, what J.D. Bernal called ''jokes of nature'' (Urey, 1962). The early 1960s saw reports of ''organized elements'' in carbonaceous meteorites derived from asteroids. Claus and Nagy (1961) believed that these entities could be microfossils indigenous to the meteorite. Subsequent studies revealed that these entities were either exogenous materials, such as pollen and fungal spores that had contaminated the sample, or endogenous materials such as olivine crystals (Fitch and Anders, 1963). Observations of cell-like morphologies have also been used to suggest evidence of life in meteorites from Mars. Scanning electron microscope (SEM) images of ALH84001 revealed segmented tubular structures that were interpreted as fossil nanobacteria (McKay et al., 1996), though later work implied such features were related to crystalline pyroxene and carbonate growth steps (Bradley et al., 1997). Cell-like morphologies have also led to misinterpretations of evidence for early life on Earth. The 3.5 Ga Apex Chert in Western Australia contains filament structures that were once interpreted as oxygen-producing cyanobacteria (Schopf, 1993), yet modern interpretations of the host rocks suggest that the structures originated in a hydrothermal vent rather than the originally proposed shallow sea floor setting (Brasier et al., 2002). The Apex Chert filament morphologies that were assigned to a biological origin have also been reinterpreted as carbon that may be organic compounds generated by Fischer-Tropschtype reactions during hydrothermal serpentinization of ultramafic rocks (Brasier et al., 2002) and as organic molecules that adsorbed onto self-organized crystal aggregate biomorphs (Garcia-Ruiz et al., 2003) or exfoliated phyllosilicates (Wacey et al., 2015). The filamentous morphologies have also been reinterpreted as aggregates of hematite microcrystals (Marshall et al., 2011). It is worth noting that a biologic origin of the filamentous microstructures has not been demonstrably excluded, since they could represent remnant chemolithoautotrophs that lived in a hydrothermal setting (Schopf et al., 2018). In general, cell-like morphologies remain controversial because there are many processes in nature that generate life-like microscale objects that include tubular, filamentous, framboidal, and dendritic structures (e.g., Cosmidis and Templeton, 2016;Garcia-Ruiz et al., 2009;Kotopoulou et al., 2020;Muscente et al., 2018;Rouillard et al., 2018;McMahon et al., 2021). Given the extensive history of incorrect interpretations for life based on morphological evidence alone, morphology is not considered a reliable stand-alone criterion for or against life, though it may be useful when associated with chemical information or to inform subsequent steps in the test sequence (e.g., Step 4).
Cultivation. The SSAF is in agreement with the position of the NRC Committee on Mars Sample Return Issues and Recommendations that ''Attempts to cultivate putative organisms, or to challenge plant and animal species or tissues, are not likely to be productive'' (NRC, 1997). The major limitations of this approach are that cultivation is not even possible for most terrestrial organisms and challenge tests are typically tailored to one or a few targets of interest. In addition, it is not considered advisable to multiply viable organisms that could have unknown and potentially harmful consequences. Therefore, cultivation is not considered a diagnostic tool used by the SSAF. As an indirect consequence and due to the limited diagnostic scope that covers the potential avenues of causing harm, animal and plant inoculation are ruled out as well.
3.3.3. Integrated test sequence and candidate instruments. Figure 3 describes the test sequence, and Fig. 4 explains the nomenclature used in the context of the test sequence. Rather than applying a scattergun approach (i.e., using all techniques available) or a piecemeal approach (i.e., focusing on individual steps or using a particular technique), it is critical to establish an ensemble of techniques and instruments capable of producing the information required for the safety assessment. Table 3 includes a number of techniques and instruments that could provide this information. The list of analytical instrumentation draws heavily on the list prepared by the MSPG2 . Some techniques are complementary and overlap with other techniques, which, from a science point of view, is advantageous. From a safety assessment point of view, complementary or overlapping information acquired with different levels of sensitivity and specificity could lead to challenges in its interpretation if this is not considered in advance. In   FIG. 3. Overview of integrated test-sequence. The test-sequence is a set of sequential investigations (i.e., steps), each one responsive to the previous steps. There is only one real gate-Step 8-in terms of stopping any further investigations and declaring a sample tube safe within the pre-defined level of assurance.
Step 9 establishes a Hold & Critical Review for any sample investigations and executes a set of activities to evaluate all relevant data and the risk management measures, before deciding on the next steps.
FIG. 4. Nomenclature used in the context of a test-sequence. The elements of the test-sequence (i.e., investigations) address individual questions. Each investigation includes typically more than one measurement technique or instrument. The measurements provide the data that are discussed at the level of investigations. The safety assessment for one sample tube is based on the scientific assessments of the individual investigations carried out on the subsamples. this context it is considered essential that, regardless of the instrumentation or techniques that are ultimately selected, their limitations are well-understood and their performance is known and dependable (see chapter 4.2 for a way to address these issues).
Steps 1 to 3 of the test sequence are concerned with identification of features that are not associated specifically with living entities, although they may have been formed by life. Measurements focused on analyses of textures, mineralogy, chemistry, and gases are the same types of analyses that are currently used to identify possible biological and biogenic features in geological materials. These first 3 steps employ routinely tested and well-understood analytical techniques, such as microscopy, spectroscopy, and chromatography. For some steps, only one technique might be applicable, though it is inevitable that, for many of the steps, several different instruments could deliver the required results. For example, given appropriate calibrations, the mineralogy of a sample could be determined by optical or electron microscopy, IR spectroscopy, Raman spectroscopy, or X-ray diffraction. It is also the case that the same instrument could deliver required information for several steps. For example, Raman spectroscopy can identify the mineralogy of a specimen (Step 3) and the types of organic molecules (Step 4) that it contains.
Steps 4 to 6 of the test sequence cover analysis of organic material, including organics that are not necessarily of biological origin. There is a wide variety of techniques and instrumentation available for the required analyses.
Step 4 is a measurement of the presence or absence of organic compounds. Moving from Step 4 to Step 6 employs techniques of increasing specificity to enable acquisition of the required information: if organic material is present, what are its characteristics? The information includes recognition of molecular patterns and isomeric variations associated with individual species (e.g., amino acids, lipids) as well as the presence of macromolecules (which may, or may not, be polymeric). When using the techniques available at the time of this writing, progression to Step 6 requires increasing invasion of the selected subsample through treatment with a sequence of solvents (i.e., polar, non-polar, acidic, alkaline) to produce solutions for introduction of the processed sample into appropriate analytical instrumentation. The main technique for analysis of organic species is mass spectrometry, though the differing chemistries and molecular masses of the components require specific methods to introduce samples into the analyzer. Examples in use currently include Gas Chromatography (GC), Capillary Electrophoresis (CE), and High Performance Liquid Chromatography (HPLC). Alternatively, high molecular mass compounds can be analyzed by imaging mass spectrometry techniques (e.g., MALDI-MS, Laser Desorption/ Ionization Mass Spectrometry (LDI-MS), Desorption Electrospray Ionization Mass Spectrometry (DESI-MS), nano-DESI-MS and ToF-SIMS). These techniques enable in situ molecular analysis at high spatial resolution when coupled to optical and electron microscopy and constitute an area of research that is rapidly developing. Depending on the ionization method, the techniques can analyze a wide mass range (1-100,000 Da) with a spatial resolution down to less than 1 micrometer with minimal sample preparation (Watrous et al., 2011;Heeren, 2015;Bodzon-Kulakowska, 2016). At Step 7, the question changes from how best to identify the characteristics of organic material to whether the material has come from a living (or dormant) biological form of life. The equipment proposed for Step 7 assumes that any organisms present have characteristics that produce analogous signals to those that we observe on Earth, and hence, they can be detected by the same instruments used for determination of terrestrial evidence of life.
Step 7, then, is looking to sequencing techniques for amplification of genetic material. Nucleic acids are relatively easy to detect, and moreover their genetic sequences can reveal a vast amount of information about the life forms that synthesized them. Variations of PCR can provide further information. For example, qPCR can quantify gene copy number (and thus estimate cell number), and reverse transcriptase PCR (RT-PCR) can bias the assay in favor of active, rRNA-rich cells. High throughput metagenomic and transcriptomic sequencing is increasingly being used to more fully characterize microbes and their activities, which requires greater amounts of nucleic acids for analysis since there is no amplification step. The technology has now advanced such that as few as 50 cells may be fully characterized (Minich et al., 2018). Additional analytical methods include single cell genomics (Woyke et al., 2017) and a mini-metagenomics approach, which can characterize the genomic features of 5-10 cells (Yu et al., 2017). Both single-cell genomics and mini-metagenomics require the amplification of DNA from cell(s) and are designed for samples with low cell abundance. These detection and characterization techniques are relatively mature, such that while we anticipate incremental improvements in the coming decade, the fundamental principles will likely still apply.
Step 8 is going beyond the familiar, that is, agnostic life detection with an amplification step and minimal assumptions. From a sample safety assessment framework point of view, this is the most important element in the test sequence and the only one with a clear gate. At the same time, it is the least defined step in terms of techniques and robustness that can only be addressed by targeted developments. Two concepts are described that could benefit from such targeted developments. The first concept is focused on the identification of non-canonical information polymers. Current nanopore-based sequencing technology is well suited to expanding the search for informational molecule patterns beyond the specific amino acids and nucleotides conserved in contemporary extant life on Earth. This technique is amenable to analysis of the diversity of informational polymers that might have been common in a pre-RNA or RNA world, before the diversification of life and dominance of DNA-and protein-based life ( Joyce, 2002). Polymerase evolution and design experiments have found six additional possible RNA alternatives and precursors, such as threose nucleic acid (TNA), hexose nucleic acid (HNA), and other xenonucleic acids (XNAs, which are nucleic acids not found in nature), all of which can store and transmit genetic information (Pinheiro et al., 2012). Strands of XNAs can also bind to target ligands with high affinity and specificity, which demonstrates the capacity for preferential folding that is associated with Darwinian evolution. While this study is speculative about the nature of RNA alternatives and precursors, there are many examples of DNA and RNA alternatives used in nature that include methylated forms of DNA (Moore et al., 2013), a 2,6-diaminopurine found in the DNA of bacteriophages (Sleiman et al., 2021), and over 120 modified forms of RNA found in ribosomal and transfer RNAs (Schaefer et al., 2017). These exceptions to the highly conserved structures of DNA and RNA only strengthen the need for a capability that extends beyond characterization of the standard forms of DNA, RNA, and proteins when searching for unfamiliar life. Life as we know it is generally based on multiple classes of polymers with conserved sets of monomer units. The differences in number and sequence order of these monomers--their informational content--are what distinguish the structure and function of these types of polymers. Repetitive polymers are not necessarily informational or biological molecules, however; and abiotic polymers of carboxylic acids and amines (e.g., nylon or polyester) represent a case where neither is true. Biology with a unique origin may capitalize on the informational capability of unique semirepetitive polymers based on alternative genetic alphabets (monomer chemical structures), which would require analysis of any polymer that contains a set of semi-repeating monomers. This type of sequencing is possible with nanopore (electrochemical) devices that can detect a broad range of water-soluble, charged molecules (nucleic acids, proteins, polyions, etc.) with simpler and faster sample preparation than required by other commercially available sequencing platforms. Nanopore sensing is ''agnostic'' in that it analyzes any linear polymer that enters the pore. Nanopore devices can distinguish between monomers with slight differences in shape, volume, or polarity and only require a template to tune for the voltage-driven translocation rate for identification (Branton et al., 2008). Nanopore analyses have been used to sequence RNA (Garalde et al., 2018), inosin-bearing oligonucleotides (Carr et al., 2017), methylated nucleobases (Rand et al., 2017;Simpson et al., 2017), and even proteins (Ouldahli et al., 2020). The proposed concept can be used to interrogate returned samples for non-canonical polymers that could indicate a novel informational or catalytic polymer distinctive from those used by biology on Earth.
The second concept is focused on randomly generated oligonucleotides to build an informatics fingerprint that represents the binding complexity of a particle surface. The patterns of nucleic acid binding to surfaces, independent of their biological function, can be used to probe and report on any chemical environment, which opens up a new way to detect evidence of life. This concept ( Johnson et al., 2018) targets the secondary and tertiary structures that oligonucleotides naturally form that can have affinity and specificity for a variety of moleculesfrom specialized biomolecules, such as peptides and proteins (e.g., Jayasena, 1999), to non-linear polymers, and even to inorganic substrates such as mineral and metal surfaces (Cleaves et al., 2011;Ye et al., 2012). Short DNA sequences (*15 nucleotides) or ''aptamers'' will bind to all types of chemical structures in complex samples, similarly to how antibodies bind to analytes. Unlike antibodies, however, aptamers are agnostic in that they comprise a nearly unlimited variety of binding specificities, whereas antibodies have been selected for recognition of limited types of biomolecules. Aptamer binding is driven by the surface chemistry of the analyte and limited only by chemical characteristics that discourage DNA binding, such as occurs in those regions of strong negative charge or when there is a deficit of aromatic or hydrophilic moieties. By accumulating large numbers of binding sequences that reflect different compounds in a mixture, statistical data analyses of aptamer motifs and sequence counts generate patterns associated with increasing levels of complexity that distinguish biological surfaces to be analyzed. This pattern recognition, known as ''chemometrics,'' represents a set of protocols that can be applied to find patterns in chemical data sets (Nie et al., 2015), which in turn can be used to fingerprint agnostic evidence of life. The statistically derived level of complexity in aptamer sequences can be analyzed to generate highdimensionality chemometric score plots that reflect the complexity and assumed biogenicity of the resulting pattern.
To optimize the use of sample material, it may be possible to use a sample for more than one investigation or analysis. In the context of the SSAF, this approach would only be acceptable if it is shown that multiple uses of sample material cannot lead to an increased false-negative rate in the overall assessment. Figure 3 shows the entire test sequence. Investigations in Steps 1 and 3 inform two kinds of decisions:

Decision criteria
Sequence of opening and investigating the individual sample tubes from Mars. Number, type and locations for subsampling the sample in each sample tube.
There are no yes/no criteria or specific thresholds levels to reach a decision for these two steps. The decisions will need to be based on informed judgements. A positive test for organic compounds in Step 4 is suggestive of the potential for biology, although abiotic chemistry (e.g., that found in carbonaceous chondrite meteorites) or terrestrial contamination can result in the presence of organic compounds as well. A negative test for organic compounds in Step 4 does not necessarily indicate the complete absence of organic molecules. Rather, it would indicate that--if any molecular evidence of biology is present in the sample--it is in very small concentrations that are below the level of detection or strongly bound to the substrate. A positive test for molecular patterns (Step 5) should be viewed as highly suggestive of the potential for active or recent biology. The abundance and the signal-to-noise of the patterns (for example homochiral in all species vs. 20% enantiomeric excess in some species) must be compared to plausible abiotic formation and preservation processes for such compounds and the best current knowledge of the samples and martian environment. A negative test for molecular patterns with a positive test for organic compounds suggests that if biology is present, it is overwhelmed by organics or degraded organic material or that biology is absent. A positive test for macromolecular patterns should be viewed as highly suggestive of the potential for active or recent biology or terrestrial contamination. The nature of the macromolecules would need to be assessed in the next steps to determine whether they arise from terrestrial contamination or martian biology and if these macromolecules are suggestive of extant or preserved extinct biology. A negative test for macromolecular patterns with a positive test for organic patterns suggests that, if biology is present, it is a metabolic hypercycle (Eigen and Schuster, 1997) or uses macromolecules that are resistant to analysis or that the life died and its macromolecules degraded before analysis. Best current knowledge of the samples, martian environment, and the environments the samples have experienced from collection to analysis must be used collectively to assess whether the molecular patterns observed could have originated from degraded biological macromolecules.
Failure to detect organic compounds, molecular patterns, or macromolecules is not considered sufficient to declare a sample safe. Among other reasons for a negative detection (e.g., strong binding to the mineral matrix), the sensitivity of the available techniques could miss the equivalent organic molecules of hundreds to thousands of terrestrial cells (see Section 3.3.2). As a consequence, a negative detection in Step 4-6 must be followed up with an amplification step (i.e., Steps 7 and 8).
Step 7 is important for two reasons-to detect any remnant terrestrial biological contamination in the samples and to detect evidence of martian life that is similar to terrestrial biology. It is expected that this step could lead to a number of positive events that are likely associated with terrestrial contamination. However, until any evidence for life can be clearly associated with terrestrial contamination, the conservative assumption (positive hypothesis) is that it could be martian biology. A negative detection in Step 7 would demonstrate that the samples are free from terrestrial biological contamination, within the detection limits of the analytical techniques. Even so, the potential for martian life to be present still cannot be excluded because this step is highly biased toward life as we know it. The only definitive gate is actually Step 8. If there is no evidence for the presence of martian life in the samples and there are no open, uncertain, or ambiguous issues remaining that could associate sample characteristics to martian biology, then the sample of a sample tube would be deemed safe within the pre-defined level of assurance.
In the case that potential evidence of extant martian life is detected, a Hold and Critical Review (HCR) must be initiated to evaluate the status quo before proceeding. This approach is similar to having a spacecraft enter a safe mode: until it is understood what triggered the safe mode and it has been concluded that it is safe to proceed, normal spacecraft operations would be suspended. Details of the HCR must be described in the Sample Safety Assessment Protocol (SSAP). The Critical Review must include a comprehensive and holistic evaluation of all relevant data acquired, the analytical techniques and specific instruments and equipment used, the methods and procedures used to control the safety of Earth (e.g., containment design and operations, sterilization procedures and criteria), and the overall risk assessment. Only then could it be decided as to whether the Hold would apply to investigations on subsamples from the one sample tube being analyzed, on samples in other sample tubes, and/or on samples already released from containment. Further investigations that are responsive to the data and the understanding at that time would likely be required to assess whether and how a hazard analysis could be executed. While not directly a concern for the safety of a specific sample, finding evidence of extinct martian life must also lead to an HCR. In such a case, the overall risk posture reflected in the level of assurance must be reviewed. Establishing the initial level of assurance typically follows a conservative approach. However, there is a significant difference between the a priori assumption that there is life on Mars and having evidence that life emerged on Mars. The need for this is further illustrated by samples from Earth that simultaneously contain evidence of both extinct and extant life (e.g., surface exposed rock on Earth that contains evidence of ancient fossils and viable microbial inhabitants).
The HCR approach would have to be reflected in agreements that cover the release of samples from the SRF and their subsequent use. An important aspect in terms of managing expectations is to acknowledge that an HCR might be a re-occurring event due to possible terrestrial biological contamination. Comprehensive contamination knowledge (CK) could expedite the HCR. The HCR and any decision associated with it must be performed by an independent team that has decision authority for executing the SSAP (see additional details in Section 4.1). Figure 5 describes some of the possible outcomes of going through the test sequence. The element of terrestrial biological contamination is specifically highlighted in several cases.

Implementation of the SSAF
The implementation of the SSAF is focused on the safety assessment of each individual sample tube. As already pointed out in Section 3.3.1, the most effective approach is to start with the sample tube(s) that have the highest pre-test proba-bility with respect to finding martian life. In the case in which a dedicated dust sample is returned to Earth, this might be a good starting point. The result of such an assessment can inform the pre-test probability for other sample tubes.
In estimating the amount of sample material needed to inform the safety assessment, it must be taken into consideration that the sample safety assessment and many mission science objectives are complementary, overlapping, and apply similar methods. None of the samples used to inform the sample safety assessment should be considered wasted, as scientifically useful data will be generated and will inform each step. The amount of sample material required is naturally inversely proportional to the amount of biological material present and strongly depends upon the extraction FIG. 5. Eight generic cases for one sample tube are shown in this figure. This example is based on a situation where 14 negative results running the test-sequence on every single subsample are required to achieve a pre-defined level of assurance. Cases A and C are straightforward. Case B represents a situation where we have either terrestrial contamination or evidence for abiotic martian organics. Cases D and E represent a situation where we have evidence of martian life that is quite different from terrestrial life. Cases F, G, and H represent a situation where we either have terrestrial contamination, evidence of martian life that is quite similar to terrestrial life or a combination of both. processes and analytical techniques available at the time. Considering collectively the current capabilities for measurements on natural samples, the number of subsamples per sample tube required to reach a certain level of assurance that no martian life is in a sample (Section 3.2), and the quality control necessary to achieve confident results (Section 4.1), it is estimated that hundreds of milligrams to a few grams of sample mass per sample tube are required to inform the safety assessment. It is expected that targeted developments in extraction processes and advances in analytical techniques would further reduce the required amount of sample material to be processed and inform the sample safety assessment.
In addition to the four elements of the SSAF (Fig. 1), there are a number of implementation constraints that are part of the SSAF and are described in the following section.

Quality control
The consequence of an incorrect safety assessment could range from reduced sample access to harmful impacts on Earth's systems. To increase confidence in the results of the assessment, it is essential to have decision-critical investigations in the test sequence performed independently by more than one team. The detailed implementation of this approach will depend on the nature of the investigation and the associated measurements. There are three possible implementation approaches: Investigations where a single measurement conducted (e.g., an XCT scan of the sample) is deemed to be determinative, and two different teams independently analyze and interpret the data. Investigations where two independent measurements utilize a single technique, for which significant expertise and experience are required to obtain reliable results (e.g., two GC teams analyze aliquots of an extract). Investigations where two independent measurements made with complementary techniques are utilized to increase the predictive value of the results (two different techniques are used, e.g., spectroscopy and spectrometry, to analyze aliquots of an extract).
The use of complementary techniques increases the probability that a given result is true if the datasets agree and that they will trigger further investigations if disparate. Although use of the same technique more than once compensates for intra-technique variability, it is more critical to address the measurement accuracy in the safety assessment context (since precision is already accounted for in the selection and validation of the chosen techniques). The execution of the test sequence must follow an approach typically used in science and engineering when assessing public safety or environmental impacts, namely deploy two independent teams to perform the measurements or data analysis (see three cases above), with a third independent team responsible for decision making. This approach must be considered in the planning of opportunities for science teams that will cover the objective-driven science investigations on the samples, some of which will inform the safety assessment, and in the planning and operation of the associated infrastructure (e.g., SRF).
All analytical methods used for the sample safety assessment must be documented and independently reviewed in advance. Any variations that occur as a single incident or result in a change of the test must be assessed and their consequences recorded at the time. ISO 17025 (ISO 17025, 2017) and equivalent standards are the mark of a laboratory with good quality systems, record keeping, and general operation, including appropriate staff training. For laboratories that test samples from Mars, these standards are a sound foundation on which to build to ensure reliable results. To allow scientific inquiry to follow a thread that is informed by successive findings that may not have been foreseen, while at the same time maintain a high standard of quality control, record keeping, data integrity, and data security, it is essential to apply methods of Good Laboratory Practice (GLP) (e.g., OECD, 1998) and Hazard Analysis at Critical Control Points (HACCP). These are routine methods used in the clinical setting and in industrial process quality control. HACCP was developed for the food industry, including to assure the safety of food products for the U.S. space program, but can be adapted for almost any complex operation in which safety risks and potential risks to a product are concerned (e.g., Hulebak and Schlosser, 2002). An HACCP-like process is a dynamic way to predict problems in advance and put in place risk-mitigating steps before any experiment or process is performed. In a general application, these risks may be anything from instrument failure and external contamination of a key sample or product to a human mistake, and can include factors that affect safety, quality or scientific output, and integrity. In any process there are steps where something could go wrong, and HACCP-like analysis concentrates on these points. Inevitably ''stuff happens,'' and the lessons from these events are used to update the HACCP-like assessment and mitigations in a continuous fashion. Traceable records that maintain a log of any alteration in procedures or risk assessment performed and by whom are kept throughout. The HACCP-like process can be applied theoretically while mapping the process. For the SSAF, this must be supplemented through full scale sample safety assessment simulations with analogue materials and implementation of the final test sequence. HAZOP (Hazard and Operability) is an example implementation of the basic HACCP process for general industrial use and is explained in detail in the IEC 61882 standard (IEC 61882, 2016). Figure 6 shows the basic steps involved in setting up a HACCP-like system that can be adapted to fit the workflows around handling samples from Mars. It shares some common features (e.g., documentation) with the ISO 17025 standard, and the two processes can be combined. Aspects of the HACCP-like process can be developed alongside the set up and calibration of the instrumentation that, under ISO 17025, must be performed at the actual site of use before any genuine part of the test sequence is undertaken.
Humans are a key factor in creating errors, most of which arise from a lack of training, lack of experience in a particular situation, poor management practices that place workers in a situation they cannot control, which can garner fear of raising concerns or simple lack of coordination within the team. A related cause is an ergonomic one, where poorly laid out displays, inaccessible controls, poor seating or repetitive manual tasks predispose operators to errors and disasters. Aircraft flight crews and space crews are trained to work together, especially if something goes wrong, and are usually deployed in teams where such skills and leadership are essential. Human factors are a critical part of the SSAF and must take into account deliberate wrongdoing as part of a framework for detecting and mitigating an adverse event.
The sample safety assessment, must always consider what will ultimately be done with a sample. Information gathered even after the samples are sent for further analysis (i.e., after they are declared safe) could indicate the need to update the safety assessment for all or specific samples. Therefore, if the sample handling and analysis protocols are modified from those in the submitted program of work, the sample safety assessment protocol will also need to be reviewed to ensure it is still valid for the new circumstances.

Analogue test program
To make an informed judgement for subsampling and to optimize the capture rate requires high-resolution physical and chemical information about the samples-covered by Steps 1 and 3 of the test sequence. This information is a necessary, though not sufficient, part of the process of rendering a robust and credible informed subsampling decision. The second and essential part of the process requires correlation of this information with knowledge base information obtained from subsampling terrestrial samples. To establish such a knowledge base requires an analogue test program tailored to include the types of materials expected from Mars that are analyzed with the types of instruments to obtain the kinds of measurements that are planned to be used to establish the physical and chemical sample information from returned martian samples.
Steps 1 and 3 are only part of the test sequence. The overall test sequence is a series of individual tests of dif-ferent types, each with its respective sensitivities and specificities. In terms of the Bayesian analysis, what matters is the overall sensitivity and specificity of the test sequence. This overall sensitivity and specificity might be derived from combining the sensitivity and specificity of the individual measurements of the test sequence. Although it is important to establish the sensitivity and specificity of the individual measurements to optimize the use of samples and maximize the incremental flow of information, it is not likely that a pure mathematical combination of these values would correctly reflect the sensitivity and specificity of the overall test sequence. By exercising the test sequence on terrestrial samples that represent the expected material from Mars using an analogue test program, it is possible to optimize details of the test sequence (i.e., by guiding selection of appropriate measurement techniques and instrument ensembles as well as the order of applying them on samples). General considerations for optimization of the test sequence are: The selection of instruments that will generate the highest true-to-false positive ratios; Ordering of the sequence of analyses to start with instruments with a higher sensitivity before moving to instruments with higher specificity.
To illustrate the utility of this approach to optimize the test sequence, a common issue for almost all investigations that target the organic content of samples is the separation of organic molecules, cell debris, or whole life forms from a mineral matrix. The best sensitivity and specificity of the analytical techniques used in the test sequence can only be employed if the targets of interest (i.e., molecules, cell FIG. 6. The example described here aligns with the ISO 17025 process in many areas ð , so that the same documentation can be shared between the two systems. The difference is that HACCP-like analysis covers risk in addition to defining a set of documented processes and allows that risk to be mitigated by forward planning and regular review. This approach is described for application on the detailed SSAP, derived from the SSAF. The term hazard in the context of the HACCP-like quality control measure proposed describes hazards to the SSAP process and not the potential biological hazard of material from Mars.
debris, cells) are presented in a useful form. A poor extraction efficiency increases the chance for false-negatives, independent of however good the analytical techniques are. Some mineral matrices are well known for their ability to retain organic compounds, owing to their surface properties and structure, for example, clay minerals. Other matrices are known for their propensity to attract organic compounds as a function of their chemical properties and structure, for example, macromolecular organic matter and carbonaceous materials. Mineral matrices that retain organic compounds are actually used on Earth in analytical and industrial processes as fluid filters to remove organic compounds, for example, clay and activated carbon gas filters for analytical chemistry. For organic compound extraction chemistry, the maxim is that ''like dissolves like,'' that is, to isolate a compound it must be matched with a solvent of similar properties. However, due to the different polarities of organic molecules, for example, amino acids are relatively polar, while hydrocarbons are relatively non-polar, no single solvent system is able to extract all organic molecules in a sample (Mitra, 2004). The type of matrix (rock) that the organic molecules are trapped in also affects the solubility of the analyte (Mitra, 2004). It is possible to use mixtures of less and more polar solvents to extract organic compounds of different polarity or change the solubility of the analyte using ultrasonication of the solvent or to use supercritical fluids (Mitra, 2004). In general, extraction protocols need to consider the full range of polarities presented by the potential target materials. Streamlining the extraction protocols, in particular consideration of the solvent strength for sharing an extract for multiple different analyses, would be beneficial to limit the use of sample material and support the independent analysis approach required in the frame of the test sequence. In this context, it is important to be aware that some solvents might interfere with other types of analysis (e.g., phenols used for certain omics investigations might interfere with other organic analyses) or exhibit inhibitory effects (e.g., denaturing). The complete extraction of organic compounds from highly retentive matrices may be unachievable, though the most efficient levels and knowledge of the extraction efficiencies are essential for the sample safety assessment. It is also necessary to exercise all four elements of the safety assessment (not only the test sequence) in end-to-end tests that utilize analogue samples. These end-to-end tests must include blind testing as an integral and essential part of the quality control measures. The added value of blind testing, however, can only be realized if the blind tests are well prepared and properly executed (e.g., Ginsburg, 1997;Casertano et al., 2008;Evans, 2014;van Driel et al., 2019). Such end-to-end tests can be used to optimize the sample flow and help to estimate the resources needed to perform the safety assessment. In addition, end-to-end tests serve to educate and train the personnel and test the various elements of the infrastructure, equipment, and instrumentation necessary to conduct the sample safety assessment.
In summary, there is a need for a tailored analogue test program that covers the following components to transition from the SSAF to an SSAP: 1. Assess and improve the capture rate and the associated subsampling strategy.
2. Optimize the selection of the instrument ensemble to be used for the test sequence and estimate the overall sensitivity and specificity, including the efficiency required to extract evidence of life from the host materials. 3. Exercise all four elements of the safety assessment, including blind testing, to optimize processes, test equipment and infrastructure, and train personnel and science teams.
The selection of the analogue samples to be tested needs to be based on the specific environment (e.g., Cockell et al. 2019) and the information obtained during sample collection on Mars. The analogue materials could include synthetically made samples, natural terrestrial analogs, and meteorites. Due to the special role of clays (see Section 3.2), the assumptions about random and targeted subsampling must be verified as part of the analogue test program. The analogue samples must include both negative and positive controls. These could include sterilized and/or organic-free analogue samples (negative controls) and samples doped with microbes and/or organic molecules or well-characterized natural terrestrial analogues known to contain life and/or organic molecules (positive controls).

Terrestrial biological contamination
Martian meteorites have been shown to be colonized by terrestrial organisms (Toporski and Steele, 2007). In the same way, terrestrial biological contamination of martian samples returned to Earth by the MSR Campaign would reduce the specificity of the overall safety assessment test sequence (see Section 3.1.1). It might also lead to a reoccurring Hold and Critical Review (HCR) of activities on the samples until the root cause of a detection can be clearly identified as terrestrial biological contamination (see Fig. 5). The contamination baseline for returned martian samples must be established from the CK obtained during the assembly of the various spacecraft that will fly as part of the MSR Campaign, along with blanks and witness samples returned with the martian samples. Of particular importance in this regard are the M2020 Witness Tube Assemblies (WTA), which are opened and sealed during different mission phases including pre-launch, launch, cruise, and Mars Entry Descent and Landing (EDL), and M2020 surface operations, as well as the M2020 drillable blank which can provide CK of the M2020 drilling operation. CK samples should also be collected during the construction of the SRF to establish a complete archive of potential contaminants, including biological contaminants that may come into contact with martian samples during sample analysis. Minimizing terrestrial biological contamination in the samples and a higher CK would reduce uncertainty in the scientific interpretation of the data and ease handling and treatment of the samples.
To differentiate between martian or terrestrial origin, the field of omics will play an essential role. The use of transcriptomics, proteomics, and metagenomics can provide a predictive comparison of material at protein, mRNA, and DNA levels, respectively. The biological CK samples (e.g., fallout coupons, spare hardware, microbial DNA and isolates collected from the assembly and test phases during pre-launch, etc.) can be used as a reference library of pre-flight conditions that can be directly compared to any signals from the potential biological material. The direct comparison may allow for assessments in expression profiles, unique or modified proteins, and changes in the DNA that occur during spaceflight. These advanced molecular techniques are commonly used to study both complex microbial communities and the evaluation of environmental stressors, such as the space environment and catabolism of pollutants in bioremediation (e.g., Biljani et al., 2021;Chandran et al., 2020;and Kumar, 2020).

Life detection and machine learning
The sample safety assessment defined by the SSAF depends upon the simultaneous interpretation of numerous variables and criteria as well as proper statistical treatment of large datasets that exclude false negatives and false positives. In several ways, this challenge is similar to that of biogenicity tests for putative traces of fossil life in the Earth's rock record. Classical tests of biogenicity in deep time involve the evaluation of multiple biosignature characteristics and context-and contamination-related criteria that need to be satisfied to substantiate a claim (e.g., Buick, 1990;Schopf et al., 2010;Brasier and Wacey, 2012;Neveu et al., 2018). The number and combinations of these characteristics and criteria, however, are subject to debate, since it is easy to include false positives or exclude false negatives. In reality, there are no clear yes/no answers in biogenicity tests since all biosignatures have a certain probability that life created them and a certain improbability that an abiotic process created them (Des Marais et al., 2008). The qualitative nature of many individual biosignatures (e.g., morphological characteristics) add further ambiguity, as they are often not standardized and depend on the interpretation and experience of individual observers. To overcome this inherent uncertainty in life detection and during efforts to exclude a biological origin, several recent studies have expressed the specific need for standardized criteria and a more quantitative approach in data treatment (Chan et al., 2019;Neveu et al., 2018;Rouillard et al., 2020Rouillard et al., , 2021. The use of multiple well-defined and quantifiable variables for life detection could greatly benefit from recent advances in statistical methods and machine learning methods to find commonalities in large datasets. While standard statistical methods passively draw inferences from a dataset, machine learning methods create mathematical models based on training data and use this ''experience'' to find predictive patterns in new datasets ( Jordan and Mitchell, 2015;Hastie et al., 2017). Typical tasks carried out by machine learning include classification, regression, ranking, clustering, and pattern recognition. A so-called supervised machine learning algorithm builds a mathematical model based on a training dataset that contains both input and desired output. The program thus responds to feedback. In contrast, an unsupervised machine learning algorithm builds a mathematical model without any desired output. It finds patterns in a dataset, and then attempts to find similar ones in newly supplied datasets. With the ongoing increase in computing power and the development of artificial neural networks, these collective methods are rapidly improving. Machine learning is currently transforming the field of medical diagnostics (e.g., Aggarwal et al., 2021), and is widely applied for face recognition in forensic applications (e.g., Phillips et al., 2018), general image recognition (e.g., Krizhevsky et al., 2017), use of large data sets of space missions (e.g., Kronberg et al., 2021), and speech recognition (e.g., Hinton et al., 2012). Some machine learning methods have already found applications in the detection and classification of life. For instance, convolutional neural network classification models were developed and trained to perform visual palynological identification and taxonomic classification of fossil pollen (Romero et al., 2020). For the purpose of the SSAF, there are three categories to be considered:  (Buick, 1990;Neveu et al., 2018;Schopf et al., 2010;Brasier and Wacey, 2012;Rouillard et al., 2021a). There is, however, a growing literature on ''false biosignatures'', i.e., physiochemical processes that lead to the formation of minerals or molecules with life-like features (Cosmidis and Templeton, 2016;Garcia-Ruiz et al., 2003;Jordan et al., 2019;Kotopoulou et al., 2021;Rouillard et al., 2018Rouillard et al., , 2021bMcMahon et al., 2021). It may thus prove difficult to define clearly the exact difference between category 1 ''no life'' and category 2 ''life as we know it''. An important future scientific challenge lies in properly defining this difference, and subsequently creating widely accepted training datasets for categories 1 and 2 that can be used for developing a machine learning protocol. This supervised machine learning protocol, however, would not work for category 3 ''life as we don't know it,'' since training datasets, as well as a desired output, are fundamentally missing from Earth-based analogue samples. It may be possible, though, to identify this third category by exclusion of the first two categories. This effectively involves searching for levels of complexity that are incompatible with ''life as we know it'' (category 2) and with the absence of life (category 1). Thus, a complete machine learning protocol would start with a supervised algorithm 1 to find ''no life.'' If it fails to find ''no life,'' then it is possible that there is some form of life there. Supervised algorithm 2 would then be applied to find ''life as we know it.'' If both algorithms fail, then the sample must fall into category 3 ''life as we don't know it.'' An unsupervised machine learning algorithm can potentially be applied to find as of yet unidentified patterns that may be assigned as provisional biosignatures, which can then be searched for in other samples.
In general, it is not envisioned at this point that such work will be entirely dependent on these forms of artificial intelligence. At this time, an experienced human observer is superior to a set of algorithms. However, the use of these machine learning methods in the treatment of large datasets could assist in finding patterns and focus attention on specific features of astrobiological interest. For instance, in a large set of close-up images of martian sediments, it may be useful to have a supervised machine learning program with a category 1 algorithm for ''no life'' that has been trained in grouping crystal types, grain sizes, and distinguishing common sedimentary patterns, followed by a category 2 algorithm for ''life as we know it'' that checks for a list of known biosignatures. A human observer can then discard any ''non-life'' data and focus entirely on samples identified with ''life as we know it'' or define subsets of data that can be studied for ''life as we don't know it.'' The most important application of machine learning for the sample safety assessment may be data reduction and generation of data sub-sets for subsequent study by the science and safety assessment teams.

Conclusions
The SSAF would be incomplete without pointing out the importance of transparent and professional risk communication (e.g., ESF, 2012). The information presented and the risk perception of the various stakeholders will evolve over time and might be influenced by events that have nothing to do with space exploration. It is therefore crucial to reevaluate assumptions and strategies described in this SSAF on a regular basis, and to communicate the results of this reevaluation process in a timely manner in order to build trust and preserve the sovereignty of information. A robust quality control program (see Section 4.1) is a fundamental prerequisite to achieve this aim.
The following is a summary that captures the major elements of the COSPAR Sample Safety Assessment Framework (SSAF): The Sample Safety Assessment Framework (SSAF) Safety Approach (a) Organic molecules are defined as a group of covalently bonded molecules that contain carbon and at least one more element. (b) Macromolecules are defined as organic compounds greater than 2500 Daltons.
6. The investigations that are part of the SSAF must be able to detect evidence of self-replicating biological entities (e.g., cell-like), biological entities that are replicated by other life (e.g., virus-like), and biologically active molecules (e.g., prion-like, gene transfer agent (GTA)-like molecules).
(a) The SSAF must include two or more orthogonal agnostic life detection investigations, with amplification steps. (b) Investigations that lead to safety-critical decisions must be carried out by two independent teams, after which decisions are made by a third independent decisionmaking group. (c) The conduct of tests that are part of the safety assessment must comply with ISO 17025, or equivalent quality standards, and apply GLP and HACCP methods to demonstrate the required competence and quality control.
7. The test sequence, using a stepwise approach from more chemistry-based investigations (e.g., organic molecules, molecular patterns and macromolecules) to more biologically based investigations (e.g., life as we know it, life as we don't know it), must cover both common and unique features of the samples. 8. The level of assurance needed to declare a sample safe must be specified by the appropriate regulatory authority and incorporated into the SSAF.
(a) If evidence of extinct or extant martian life is detected, a Hold and Critical Review (HCR) must be established to evaluate the relevant data and the risk management measures before deciding on the next steps. (b) No samples can be released from containment during the HCR and a procedure must be developed for samples already released from containment.
There are a number of activities that need to feed into the SSAF and some consequences of the SSAF that would need to be reflected in future sample science plans (see Table 4).
The most important near-term Research and Development (R&D) activities to enable the preparation and execution of the SSAF are: 1. Establishing an analogue test program to inform and improve the capture rate, extraction efficiency, sensitivity and specificity of the overall test sequence, and exercise the entire sample safety assessment before it is used on samples returned from Mars. 2. Maturing agnostic life detection techniques.
(c) The absence of detecting organic compounds, molecular patterns, and macromolecules is not sufficient to declare a sample safe (i.e., devoid of martian life). (d) The positive hypothesis (i.e., there is martian life in the samples) can be rejected if there is no evidence for the presence of martian life in the samples and there are no open, uncertain or ambiguous issues remaining that could associate sample characteristics to martian biology. In such cases, the tested sample of a sample tube would be safe within the predefined level of assurance.
9. The sample safety assessment is not a one-time exercise but rather a dynamic process that must respond to the results of various investigations. It must be updated if a subsequent investigation on any of the martian samples invalidates the original sample safety assessment or any of the assumptions used.
Execution 10. Every sample-tube is considered a separate sample. 11. Bayesian statistics together with the subsampling strategy and the sensitivity and specificity of the overall test sequence, allow estimating the number of subsamples necessary to reach a predefined level of assurance that a sample tube is safe. 12. The sample with the highest pre-test probability to contain martian life provides the best and most economic (time and material) starting point for executing the sample safety assessment. 13. A targeted subsampling strategy must be used to optimize the number of subsamples from one sample tube that need to be tested with the safety assessment. Three elements are required to develop such a strategy: (a) Information about the 3-dimensional morphological characteristics of the external and internal structures of each sample at a micrometer-level spatial resolution, while still in the sealed sample tube, is the required basis for planning and executing the sample safety assessment in general, and the subsampling strategy in particular. (b) Information about the chemistry and mineralogy associated with the 3-D structure to refine the targeted subsampling strategy. (c) An analogue test program to correlate the specific martian sample information to a relevant terrestrial sample knowledge base.
14. Depending on the type of fine-grained minerals, targeted subsampling (e.g., for localized finegrained alteration products or localized features, such as fractures in lithified fine-grained rocks) and random subsampling (for unconsolidated finegrained sediments) are appropriate approaches.
15. Random sampling can be applied to dust samples though it is unlikely that any serendipitous dust would be of a sufficient quantity that it can be declared safe based on a safety assessment, except for a dedicated dust sample.

Development needs
16. An analogue test program is necessary to: (a) Assess and improve the capture rate and the associated subsampling strategy. (b) Optimize the selection of the instrumentensemble to be used for the test sequence and estimate the overall sensitivity and specificity of the test sequence, including efficiency to extract evidence of life from host materials. (c) Exercise all elements of the sample safety assessment, including blind testing, to optimize processes, equipment and infrastructure, train personnel and science teams, and build confidence.
17. Contamination Knowledge (CK) covering all flight (Mars 2020, MSR program) and ground (SRF) elements is critically important to reduce uncertainty in the interpretation of the data and, as a consequence, avoid unnecessary rigor in handling and treating the samples. 18. Use of machine learning to support the sample safety assessment, in particular for data reduction and pattern recognition in large and diverse datasets, has the potential to improve the quality of the results and accelerate the process. 19. Once the MSR science investigations are selected, the appropriate regulatory authorities are in place, and any open development needs with respect to the overall sample safety assessment are addressed, this SSAF must be critically reviewed by the relevant stakeholders. The latest applicable version of the SSAF would be the basis for developing a detailed Sample Safety Assessment Protocol (SSAP). 20. A transparent risk communication throughout the development and execution of the SSAF and subsequent SSAP is essential to preserve the sovereignty of information.
Targeted investments in developing tailored machine learning capabilities to support the data reduction and data cross-correlations is considered beneficial to optimize the time economy once the samples start to enter the curation and science analysis stage. The development of such machine learning tools would need to be integrated in the analogue test program.
The only impact identified for the Mars 2020 mission and the MSR program is the need to provide Contamination Knowledge (CK) from all relevant mission phases (ground and flight) and mission elements with a potential to introduce terrestrial contamination to the samples during nominal and off-nominal events. This is considered critical for the interpretation of the data used for the safety assessment and any subsequent decisions. This CK is directly linked to the achievable specificity of the test sequence and to rectify events that would lead to a Hold & Critical Review. Therefore, the CK is an important element and driver in the schedule of sample analysis and consequences that could lead to an unnecessary rigor in handling and treating the samples.
To optimize the use of precious martian samples and remain aligned with the stated goal to use the scientific investigations of competitively selected science teams to inform the sample safety assessment, a number of elements need to be considered for planning the future selection of science teams to cover the objective-driven science for MSR: Investigations described in the test sequence. The need for independent analyses for certain investigations (i.e., more than one science team working on certain investigations). Optimizing the overall sensitivity and specificity of the test sequence (i.e., consideration of using a complementary instrument-assemble with known sensitivities and specificities).
If all elements cannot be satisfied in the course of the science team selection, then directed investigations to fill in the gaps would need to be considered by the MSR Campaign Partners.
For all practical purposes, the sample properties that need to be measured to inform the SSAF fall under the sterilization-sensitive and time-critical categories as defined by MSPG2 (Velbel et al., 2021;Tosca et al., 2021). This means that most of these investigations would need to be conducted within biological containment, i.e., a Sample Receiving Facility (SRF).
The Sample Safety Assessment Framework (SSAF) has been established with sufficient detail to allow for proper planning for a Sample Receiving Facility (SRF) and preparations for the scientific analysis of the samples. At the same time, the SSAF avoids being overly prescriptive. The SSAF is using an iterative approach to risk, combining multiple types of data and analyses, to derive an evidence-based safety assessment. As long as martian life is based on carbon chemistry, the SSAF and the subsequent SSAP would be able to identify it. The one parameter that must be set by the appropriate regulatory authorities is the level of assurance to exclude the presence of martin life. This would be the stopping threshold, i.e., level of confidence in the statement ''there is no martian life in the sample''. Setting such a level is important to avoid open-ended discussions and to better estimate the efforts and resources necessary to conduct the sample safety assessment.
Once the MSR science investigations are selected, the appropriate regulatory authorities are in place, and any open development needs with respect to the overall safety assessment are addressed, this SSAF must be critically reviewed by the relevant stakeholders. The resulting updated version of the SSAF would be the basis for developing a detailed Sample Safety Assessment Protocol (SSAP). CO-SPAR would provide an appropriate international forum to review the SSAF and develop the SSAP.
The SSAF is developed specifically for assessing samples from Mars in the context of the currently planned NASA-ESA MSR Campaign (Meyer et al., 2021) though it can actually can be used for any Mars Sample Return mission concept, with only minor tailoring. This minor tailoring would be required for the following aspects of the SSAF: Representing the specificity of sample type, acquisition, and packaging, reflected in point 10 of the SSAF. Representing the necessary CK of the applicable flight and ground elements, reflected in point 17 of the SSAF.
In addition, the SSAF is considered a sound basis for other COSPAR Planetary Protection Category V, restricted Earth return, mission concepts beyond Mars.