Defining the target population to make marine image-based biological data FAIR

.


Introduction
The way we observe and monitor the environment, and by extension conduct biological studies, is changing radically: the rise of remote sensing technologies is increasing the spatial and temporal scales of observation, and the quantity and resolution of biological data collected, with increased repeatability (Groom et al., 2006).These technological changes are driving and supporting a shift in the types of biological questions being posed (Hampton et al., 2013), towards examining longterm trends and broad-scale monitoring to address globally-relevant issues, such as decades-long ecosystem health (De'ath et al., 2012), the impacts of and organism responses to climate change (Zellweger et al., 2019) and macrosystems ecology (Soranno and Schimel, 2014;Waldrop, 2008).The opportunity to examine the state of the environment at large scales requires data to be combined and reused, such as between locations or across time points, with comparability being a key challenge (Soranno et al., 2015).Ecological informatics has developed alongside technologies to facilitate this data reuse, sharing and data-intensive science (Michener, 2006(Michener, , 2015;;Michener and Jones, 2012;Rüegg et al., 2014), including establishing FAIR (Findable, Accessible, Interoperable, Reusable) data principles (Wilkinson et al., 2016).We need to continue the development of ecological informatics, with new metadata tools tailored to remotely-sensed data and designed for integration with established systems.
The technological shift has been particularly transformative of biology in the marine realm, where remote sensing is opening the observation of new habitats, and at new spatial and temporal scales.Marine photography has become a popular remote sensing tool for studying marine biota and their habitats in situ, with the number of studies using this tool increasing rapidly (Durden et al., 2016c), and is an important method for monitoring change to ecosystems to inform environmental management and conservation decisions (Brown et al., 2004;Hill and Wilkinson, 2004).Both video and still images are used to study organisms in the water column and on the seabed, both for quantitative and qualitative assessments.Biological data collected from imagery include counts, identifications, and areal coverage, that may be used in spatial studies (Benoist et al., 2019b;Marshall and Pierce, 2012;Staudigel et al., 2006;Williams et al., 2019), analyses of distribution (Mitchell et al., 2020), and detection of habitats (Purser et al., 2009).Temporal observations include similar count or areal coverage data (Bett et al., 2001), or focus on aspects such as behaviour in time-lapse photography (Bett and Rice, 1993;Durden et al., 2015;Kahn et al., 2020;Lampitt and Paterson, 1987;Smith et al., 2005) or video (Priede et al., 2006;Thomas et al., 2017), specimen growth (Gooday et al., 1993), quantification of function (Durden et al., 2019;Smith et al., 1997), or interaction of organisms with their habitat (Rhoads and Cande, 1971).Video may enable near real time observation (Aguzzi et al., 2020).Photography may also be used in estimating sizes and biomass of organisms (Benoist et al., 2019a;Boutros et al., 2015;Dunlop et al., 2015;Durden et al., 2016a;Letessier et al., 2015), and in threedimensional reconstructions (Johnson-Roberson et al., 2010;Price et al., 2021).Such variety in study scope is challenging the way we combine and reuse imagery, and the resulting extracted biological data.Aggregated imagery-derived data is key to large interdisciplinary studies, such as ocean observing networks (Levin et al., 2019), and in producing essential metrics for monitoring changes to global biodiversity and ecosystem changes (Miloslavich et al., 2018;Muller-Karger et al., 2018); many of these data are derived using Artificial Intelligence (Christin et al., 2019;Høye et al., 2021).These efforts require the aggregation of multiple sets of captured imagery and derived data, where knowledge of the biological scope is necessary for data integration and reuse; proceeding without an understanding of the biological assumptions of survey design can introduce bias to subsequent analyses and ecological conclusions (Foster et al., 2021).Recording the biological scope of imagery-based ecological studies is also critical to monitoring of anthropogenic change mandated by regulators at national (Noble-James et al., 2018), regional (Modica et al., 2016), and international (International Seabed Authority, 2011) scales, because it can be used to ensure the comparability of monitoring data needed to detect true biological change.
A key basis for scoping a biological study is the definition of the target population.We use 'target population' to refer to the defined pool of organisms within which observations will be made or from which samples will be selected.It is derived from the objectives of the study (Eberhardt and Thomas, 1991;Jeffers, 1979) and is defined in biological or ecological terms, where it may refer to an individual, population (of a single species), or a community (of multiple species).A target population is defined (explicitly or implicitly) whether the study is explorative, qualitative, or quantitative; even for serendipitous observations, the target population is likely constrained (if not often considered).When explicitly defined up front, the target population is valuable in survey design and planning (Jeffers, 1979;Krebs, 2014), and in quantitative work is the basis from which statistical inference is derived, where it may also be known as the 'statistical population' (Mathai and Rathie, 1977).Defining the target population from the outset assists in ensuring that data collection and analysis will be practically effective and efficient (Underwood and Chapman, 2013).Knowledge of the target population alongside the biological dataset is also vital to data analysis and interpretation, both during the study and by those accessing the results.The definition of the target population is also critical in facilitating the sharing and maximal reuse of collected data, to systematic reviews, and to reassessment for meta-analyses derived from compilations of studies, making it the ideal tool for ensuring interoperability and reuse of remotely-sensed biological datasets.
Defining the target population for marine imagery-based studies requires additional considerations over traditional terrestrial or quadrat methods, because imaging imposes some unique characteristics on the data (Durden et al., 2016c).Marine imagery often varies widely in the area or volume of habitat studied because of factors affecting the available light (e.g., water column clarity, changes in the position and distance of the camera to the subject), which impacts many aspects of the scope, including the target population.The marine imagery as captured provides a snapshot in time, facilitating multiple extractions or derivations of biological data, somewhat akin to 'freezing' the status of a biological sample or quadrat.The potential for reuse or resampling of imagery involving different target populations, and for refinement and/ or iteration of the target population at different stages of processing and analysis of the imagery points to the need for a tool to assist in defining the target population, and communicating and archiving this definition as part of the metadata.
Efforts to make marine imagery data FAIR are beginning, spurred on by data publication requirements of research funders and academic journals.Construction of infrastructure (both physical and conceptual) for archiving, maintaining, and sharing marine imagery and associated metadata is still in development.Repositories accepting marine imagery include those specialising in marine or environmental data (e.g., UK Marine Environmental Data and Information Network, medin.org.uk;Sea scientific open data publication seanoe.org;PANGAEA, pangaea.de;Knowledge Network for Biocomplexity, knb.ecoinformatics.org),those specific to biological information on particular habitats or taxonomic groups (e.g., NOAA Deep-Sea Coral and Sponge Map Portal, ncei.noaa.gov/maps/deep-sea-corals/mapSites.htm), and those being developed by regulators of marine industries to hold and make available imagery from environmental monitoring efforts, such as for deep-sea mining by the International Seabed Authority (data.isa.org.jm/isa/map/).These repositories typically archive raw and/or processed photographs and video, capture information, image-derived data, and data collected synoptically with other sensors, which are archived at the dataset, image file, and/or biological specimen level, but are inconsistent in their requirements for structured metadata that would align with FAIR principles (Wilkinson et al., 2016).Imagery-derived data are also collected in repositories of biological data, for example, the Ocean Biodiversity Information System repository accepts information about biological specimens (Horton et al., 2021), while data on specimens are also incorporated into taxonomic and/or morphotype catalogues, with a potential framework for such use presented as (SMarTaR-ID; Howell et al., 2019).These repositories may link to the archived source imagery data and require structured metadata; however, neither the imagery nor biological data repositories require or include the option to archive the target population definition, and the standard vocabularies do not include sufficient appropriate terms.The recent image FAIR Data Objects protocol (iFDO) (Schoening et al., 2022) provides metadata standards specific to the archiving of marine imagery, and is now being adopted by some imagery repositories.It includes standards for metadata relating to technical aspects of image capture (Schoening et al., 2018;Schoening et al., 2022) and some aspects of biological data capture, but lacks metadata important to the target population definition.
We present a set of properties as a tool for defining and archiving the target population applicable to all types of marine imaging-based biological studies.This group of attributes is designed to document the metadata that describe the biological and methodological aims and constraints that combine to define the target population, and could be implemented as an extension of the iFDO standard.We discuss the proposed metadata terms in the context of the developing best practices for marine imaging.We also discuss the value in defining the target population using this set of attributes to data reuse, bias reduction and biological interpretation of outcomes.Finally, we suggest best practices for defining the target population in biological marine imaging studies.

Contents
The proposed group of attributes for defining the target population is generalised for all types of in-situ marine biological imaging, which is presented in Table 1 with metadata fields grouped into themes that the

Table 1
Set of attributes for target population definition, incorporating some existing iFDO fields and proposing new ones (indicated by *).Indicated mapping to DarwinCore terms ('MachineObservation' data type), and Ecological Metadata Language terms.Multiple fields map to the DarwinCore 'samplingProtocol' term, but may alternatively be stored using the ExtendedMeasurementOrFact approach; likewise, in Ecological Metadata Language many fields map to 'sampling'.The first component is the definition of the biological aims and objectives (Field name: image-objective).The target population must be explicitly constrained in terms of space (image-spatial-constraints, imageset-lat-min, image-set-lat-max, image-set-long-min, image-set-long-max) and time (image-temporal-constraints, image-set-start-datetime, image-setend-datetime, image-interval), in relation to the objectives set.It should be defined in biological terms, for example, to include or exclude certain taxa, functional groups, size classes of organisms, sex, or developmental stage of organisms (image-target-organisms, image-targets-WoRMS).The types of biological observations to be made about these organisms should also be defined (image-quantitative-or-qualitative, imagequantification-type, image-biological-metrics, image-biological-stats). Environmental and categorical constraints may also be imposed, such as limitations to certain habitats, seasons, time of day or behaviours.So far, these are common to definitions of the target population using other sampling methods.
The definition of the target population also includes metadata terms for the marine imaging methodology, as the imaging approach used to investigate the target population impacts both the resulting imagery and derived biological data.In turn, this impacts ecological comparability, interoperability and reuse to a greater degree than physical sampling methods, for example, by resulting in the variable detectability of organisms.The definition of the target population includes terms for the type of imagery (image-acquisition, image-situ, image-camera-number, image-spectral-resolution, image-FOV-description, image-bait), constraints imposed by the imaging platform or camera (image-deployment, imagecamera-orientation, image-platform, image-illumination, image-scale-reference, image-practical-constraints), image-level capture data (image-datetime, image-longitude, image-latitude, image-depth, image-meters-aboveground, image-acquisition-settings, image-pixel-per-millimeter, image-areasquare-meter), and image and derived data curation (image-curationprotocol, image-annotation-QAQC).Finally, the target population definition should include information about any ancillary data captured (image-other-data).Metadata fields for documentation (image-documentation-capture, image-documentation-processing, image-documentation-versions, image-documentation-biologicaldata, image-documentationpublications) are also included to facilitate connections between multiple image sets (see 'Defining the target population at different phases of an imaging project', below).
The fields were designed to include machine-readable content and restricted values where possible to facilitate searchability and classification of images, and significantly increasing the number of machinereadable fields storing such metadata in other data standards (see below).Many aspects of defining the target population require rationale to be documented; this includes necessary textual interpretation of constraints and choices which are more difficult to translate to machine readability.While the inclusion of text fields is not strictly FAIR, the set of attributes is an important step towards documentation and structured metadata for marine imaging in biological studies.

Extension of the iFDO standard
The group of attributes for defining the target population has been designed to allow it to be implemented as an extension to the iFDO standard.The iFDO standard includes three sections of terms: core, capture and content.Only the first of these is required.Many of the terms in the core and capture categories are included in this group of attributes for defining the target population, particularly those related to scoping, limitations, and capture methodology at the image set level (rather than for each image).While some of these terms are optional in the iFDO standard, they are required for the definition of the target population.

Mapping of terms to biodiversity data standards and environmental monitoring site catalogue
DarwinCore (Wieczorek et al., 2012) and the Ecological Metadata Language (Fegraus et al., 2005;Jones et al., 2019) are standards for biodiversity information and are used by biodiversity databases that archive biological information from images (and other data types, such as specimens), such as the Ocean Biodiversity Information System.The fields involved in defining the target population have been mapped to terms in these two standards (Table 1) to facilitate interoperability between database systems.Note that multiple fields in the target population definition map to single fields in each of DarwinCore and the Ecological Metadata Language.The fields in these two data standards are descriptive text, which do not have the FAIR features that defined fields provide.The Ocean Biodiversity Information System also incorporates the ExtendedMeasurementOrFact Extension (De Pooter et al., 2017), which may be useful in documenting some of the fields in the target population definition, but their use for documenting intended scope rather than only the resulting measurements is unclear.Mapping to these terms in the DarwinCore and Ecological Metadata Language standards facilitates the potential inclusion of marine imagery datasets in ecological community survey data harmonisation or integration designs, such as ecocomDP (O'Brien et al., 2021).
Environmental monitoring and research sites where marine imagery is captured may be catalogued in The Dynamic Ecological Information Management System -Site and Dataset Registry (Wohner et al., 2022).Attributes that describe the biological aim and objective (image-objective) and the spatial constraints on scope (image-spatial-constraints, image-set-lat-min, image-set-lat-max, image-set-long-min, image-set-longmax) map to attributes in this catalogue.

Defining the target population at different phases of an imaging project
The target population may be constrained differently in each phase of a marine imaging project.Recording the definition of the target population at each project stage that produces a biological or imaging output is important to communicate the aims and results and guide their interpretation, and to documenting the provenance of image-derived data, a key component in making these data FAIR (Wilkinson et al., 2016).Such phases may be viewed as parts of the ecological / biodiversity data life cycle (Gadelha et al., 2020;Michener and Jones, 2012;Rüegg et al., 2014), in which data documentation is considered a critical element.The stages of a typical marine imaging project along with the refining of the target population definition and appropriate outputs are shown in Fig. 1.
The initial definition is based on the project plans, survey design and acquisition parameters.From this initial definition, the target population definition may be altered by practicalities of image capture and/or image processing, such as fieldwork conditions (e.g., visibility, sea state, current) and equipment function.The definition at image capture should be archived with the original (raw / as captured) imagery.Image capture can be thought of as similar to 'freezing' the status of a sample or quadrat, with many (often manual and non-standardised) steps involved in processing and curating images, and the detection and identification of organisms in the images (a process known as 'annotation') to derive biological data.These post-capture steps involve decision-making that influences the derived data, such as constraints related to the processes of data extraction ('annotation') or statistical analysis to address practicalities or ensure statistical assumptions are met based on the resulting data.The additional constraints imposed are often reductions to the list of species of interest (e.g., removing those not visible or in insufficient numbers for statistical comparison), or constraining the number of images or volume of video (e.g., removing those captured in adverse conditions or to reduce the time required for data extraction), but may also include constraints from image processing (e.g., removing overlapping images or cropping or merging images).The definition used to constrain biological data extraction from the imagery should be archived along with the refined imagery (subsetted / processed) used for biological data extraction, along with links to the resulting biological data and any resulting publications or reporting of analyses.Thus, a single imaging event is likely to result in multiple image sets, each with a unique target population definition.

The value of defining the target population of an imaging project
The definition of the target population has value in FAIR image and image-derived data management by increasing the operational utility of the data and the resulting research outputs.Many biologists see image and image-derived data reuse as important to conducting large-scale and long-term studies of the ocean (Durden et al., 2017a), but such reuse is precluded by the data not being findable, accessible and interoperable, and a lack of understanding of potential constraints and biases in the data.This set of attributes represents a first step in documenting the vital biological context to enable FAIRness and ensure appropriate reuse.
This value is illustrated with a few examples: 1) The increased use of marine imaging presents a potential discontinuity in sampling method in long-term ecosystem monitoring.
Defining the target population facilitates harmonisation of data over long time periods including across previous sampling methods (Michener, 2006).2) Analyses of bias and variance in method and robust survey design are facilitated by access to sufficient survey data with metadata on the survey designs and methods used (e.g., Curtis et al., 2024;Foster et al., 2021).Archived marine imagery datasets with the target population defined enable such analyses.3) Environmental monitoring metrics, such as the Essential Ocean Variables and Essential Biodiversity Variables (Kissling et al., 2018;Muller-Karger et al., 2018), may be derived from imagery and contributed to repositories of ocean observing data and global data products.Archiving the defined target population with the imagery would provide important context, such as in the interpretation and confidence in taxonomic identifications, and the interoperability of data capture across observatories.4) Artificial intelligence is increasingly being trialled and deployed to detect and identify/classify organisms in marine imagery.Images (and annotations) reused as training data for the development of artificial intelligence tools could introduce bias into the identifications, as performance is related to biological community structure (Durden et al., 2021;Orenstein et al., 2020), so the definition of the target population used in image capture could be used to identify and potentially mitigate against such biases.et al., 2023;Tzachor et al., 2023).Such interoperability facilitates connection to other observational data and models, and is used in scenario testing.However, intelligent combining of imagery data requires knowledge of the target population, which is provided digitally by the set of attributes.
The definition of the target population is also valuable in the verification and interpretation of biological studies, as "population" has both biological and statistical meanings.The set of attributes will aid the explicit communication of the target population to wide variety of marine image study users, including researchers and those outside the academic sphere, such as environmental consultants, stakeholders, policymakers and government officials.It also facilitates replication of biological studies to verify conclusions drawn, a feature important to environmental management.Use of the set of attributes to define the target population will contribute to efforts to be transparent in the dissemination of environmental monitoring information to the public, for example as required for deep-sea mining (Durden et al., 2017b;Lodge et al., 2014).

Implications for ecological informatics
The definition of the target population for marine imaging studies contributes to the wider development of ecological metadata and informatics.It brings relevance to more ecological data by imposing some standardisation on this heterogeneous data type and connecting them to DarwinCore and the Ecological Metadata Format.It addresses a gap in ecological informatics and is the latest development in ecological metadata required by technology-driven changes to survey equipment and methodology.It bridges disciplines, connecting informatics related to methods to ecological and biodiversity informatics for this type of sample-based or observational data.It could also improve ecological informatics for remotely-sensed data more generally, as it could be applied to similar applications, such as drone footage (Petso et al., 2021), thermal imaging (Pagacz and Witczuk, 2023) and satellite/aerial photography (Guirado et al., 2019).

Defining the target population as a best practice for marine imaging studies
Increased documentation of best practices also improves the interoperability of marine imaging data, and the set of attributes is an important addition to best practices for marine imaging studies.Existing technical best practices for marine imaging have focused on survey design (Foster et al., 2014;Lim et al., 2018;Perkins et al., 2019;Sward et al., 2019) and practices for extracting data about organisms from images (Durden et al., 2016a;Durden et al., 2016b;Roberts et al., 2016;Schoening et al., 2016).Best practices for the standardisation of organism identification in marine photography employ image-based species guides (Amon et al., 2017;Howell et al., 2019;Jacobsen Stout et al., 2015), naming of organisms (e.g. through morphotype-based naming systems (Althaus et al., 2015), or through the use of the World Register of Marine Species (Horton et al., 2022)), and developing nomenclature (Sigovini et al., 2016).Many of these existing approaches are improved with clear definition of target populations.

Best practices for defining the target population
Defining the target population confers many benefits to those involved in a biological study using marine imaging, and to future users of such data and results.We propose the following best practices for defining the target population, using the set of attributes: 1) Begin the target population definition during study design and refine during the imaging project.2) Include the target population definition with archived images and image-derived data, using the set of attributes, with every biological or imaging output of a study.3) Consult the definition of the target population before embarking on image or image-derived data reuse.4) Environmental management and regulatory agencies should adopt the use of the set of attributes with the submission of monitoring data and results/reports.

Methods
Development of the set of attributes for defining the target population was an interdisciplinary effort.The set of attributes was created by adapting established prompts for defining the sample population for studies based on physical samples (Eberhardt and Thomas, 1991;Jeffers, 1979) for use with marine imagery studies.For this adaptation, the authors drew on their considerable experience in designing and conducting imagery studies in varied marine ecological settings using many different marine imagery systems (Durden et al., 2016c), contributions to long-term and multidisciplinary ocean monitoring (Best et al., 2016;Hartman et al., 2021;Levin et al., 2019), and expertise in image-derived data management (Schoening et al., 2009;Schoening et al., 2016;Schoening et al., 2022).The authors then tested the components on multiple current and published imagery-based studies iteratively to improve it from an ecological perspective.The attributes were then considered in relation to the established iFDO standard (Schoening et al., 2022), and additional attributes specified with the perspective of environmental data managers.

Declaration of competing interest
None of the authors has a conflict of interest.

Fig. 1 .
Fig. 1.Definition of the target population during the progression of a biological marine imaging project.

Table 1
(continued )It is important for the definition of the target population that the constraints are described in terms of intended scope at the scale of the study, not simply stored as parameters obtained for a particular image upon capture, nor noting only what has been detected.