PhytoOracle: Scalable, modular phenomic data processing pipelines

Previous crop yield improvements have been largely due to the implementation of new management strategies, mechanization, and application of emerging technologies. While these approaches have led to stable, linear improvements, increases in crop yields are currently plateauing. The use and improvement of rapid, automated, and accurate phenomic selection methods leveraging high-resolution data collected throughout a growing season could help identify stress-adaptive traits to meet the growing global food demand. As the capacity of phenomics to generate larger and higher dimensional data sets improves, there is an urgent need to develop and implement robust and scalable data processing pipelines for rapid turnaround of processed results. Current phenomics processing pipelines lack modularity and the ability to exploit the distributed computational infrastructure required for machine learning (ML)-based workloads. To address these challenges, we developed PhytoOracle (PO), a suite of modular, scalable pipelines that aim to improve data processing efficiency for plant science research. PO integrates open-source frameworks for distributed task management on local, cloud, or high-performance computing (HPC) systems. Each pipeline component is available as a standalone container which can be independently deployed or linked into a pipeline. Additionally, researchers can swap between available containers or integrate new ones suited to their specific research. PO extracts phenotype trait values such as volume, height, canopy temperature, and maximum quantum efficiency (Fv/Fm) of photosystem II from data captured in field settings, enabling the study of phenotypic variation for elucidation of the genetic components of quantitative traits.


Computational Technologies
The phenotyping datasets of the future pose new processing, storage, and analysis bottlenecks which can be addressed by leveraging both established and emerging technologies. Large scale phenomic data must be processed in a reproducible and timely manner to provide actionable insights. To address these bottlenecks, PO leverages a variety of computational technologies and resources. For example, data management systems such as CyVerse's Data Store, a cloud-based data management system built on the Integrated Rule-Oriented Data System (iRODS), provide data storage and cross-platform access during data processing [1]. Container technologies, such as Docker and Singularity, provide stand-alone environments with required dependencies pre-installed. High performance computers (HPCs) provide powerful processors, connected to fast memory, disk storage and networking to scale up processing tasks.
HPCs coupled with container technology provide a reproducible, scalable environment [2]. Larger datasets require distributed frameworks that leverage thousands of computers to process data within reasonable timeframes. CCTools [3], a suite of computing tools for deploying scalable applications, consists of Makeflow and Work Queue which provide the language and computational resource management, respectively, required to scale tasks across local, Cloud, or HPC computing environments. When coordinated, these computational resources can revolutionize the management and analysis of high-throughput phenotyping data, but significant software engineering is required.

Supported Data
PO provides an orchestration framework for processing RGB, thermal, photosystem II chlorophyll fluorescence (PSII), and 3D point cloud data irrespective of phenotyping platform. PO was developed for processing the large amount of phenomic data collected by the University of Arizona's Lemnatec Field Scanalyzer (FS), but it can also process phenomic data collected with other phenotyping platforms (e.g. drones, carts, and mobile phones). This is possible due to the modular nature of PO, which allows for the removal, rearrangement, or single deployment of pipeline components. PO can scale computation of large datasets by leveraging many processing cores on local, Cloud, or HPC clusters, allowing it to process among the largest phenomic datasets in the plant science research space. Comparisons between drone (DR) and FS data number and size highlight the need for distributed computing pipelines; for example, RGB data collections over the same region resulted in raw datasets of 458-532 images

Data Processing Pipelines
RGB and thermal pipelines leverage sensor-specific Faster R-CNN detection models for plant detection, allowing for the localization of individual lettuce plants representing many genotypes ( Figure  1). Upon detection, individual plants are isolated and analyzed for canopy temperature in thermal images and bounding area in RGB images. The geographical coordinates of each plant are collected, allowing for localization and extraction of individual plants in co-registered point clouds.

Model training and performance evaluation
For supervised ML training, RGB and thermal images were labeled using Labelbox, a web-based annotation tool. Faster R-CNN models were trained using the Detecto Python package and validated using labeled testing sets for each data type. The capture date of each image was included in the metadata to assess changes in model performance throughout the growing season. To quantify detection accuracy, the intersection over union (IoU), recall, precision, F1-score, and accuracy were calculated.

Benchmarking
CCTools Makeflow and Work Queue output processing logs, which provide information such as the total number of workers, tasks completed, and processing times. The relationship between processing time and number of processing cores was investigated using these data.

Model performance evaluation
RGB-FS achieved the greatest median intersection over union (IoU), with RGB-DR and Thermal-FS having comparable values (Table 1, Figure 2). A noticeable increase in intersection over union was observed over the growing season across all data types. Thermal-FS models had the greatest overall accuracy at 0.984 followed by RGB-DR at 0.970 and RGB-FS at 0.957 (Table 1).

Benchmarking
At the maximum number of workers tested in this study (1024 workers), processing times were: 235 minutes for 9,270 RGB images (140.7 GB), 235 minutes for 9,270 thermal images (5.4 GB), and 13 minutes for 39,678 PSII images (86.2 GB). These processing times include geo-correction of images and plant detection steps.

FUTURE DIRECTIONS
PO implements distributed computing, geospatial methods, and proximal sensing technologies to track plot or individual plant phenotypic traits throughout the growing season. The phenotypic trait data extracted from 2D image and 3D point cloud data provide large morphological and physiological phenomic datasets allowing for QTL mapping and GWAS studies. Additionally, point clouds generated by PO fill a gap in 3D ML training datasets, which can be used to develop and validate ML models for plant phenotyping applications [4]. Future directions include the development of a publicly-accessible, open-source VR experience that allows users to interact with 3D point clouds and visualize time-series phenotypic trait data within a single environment.

DATA AVAILABILITY STATEMENT
Raw, intermediate, and processed data are available on the CyVerse Data Commons. The PhytoOracle workflow repository can be found at https://github.com/LyonsLab/PhytoOracle and processing containers at https://github.com/phytooracle.

RESPONSE TO REVIEWERS
Response to reviews will be published