Latent Space Phenotyping: Automatic Image-Based Phenotyping for Treatment Studies

Association mapping studies have enabled researchers to identify candidate loci for many important environmental tolerance factors, including agronomically relevant tolerance traits in plants. However, traditional genome-by-environment studies such as these require a phenotyping pipeline which is capable of accurately measuring stress responses, typically in an automated high-throughput context using image processing. In this work, we present Latent Space Phenotyping (LSP), a novel phenotyping method which is able to automatically detect and quantify response-to-treatment directly from images. We demonstrate example applications using data from an interspecific cross of the model C4 grass Setaria, a diversity panel of sorghum (S. bicolor), and the founder panel for a nested association mapping population of canola (Brassica napus L.). Using two synthetically generated image datasets, we then show that LSP is able to successfully recover the simulated QTL in both simple and complex synthetic imagery. We propose LSP as an alternative to traditional image analysis methods for phenotyping, enabling the phenotyping of arbitrary and potentially complex response traits without the need for engineering-complicated image-processing pipelines.


Introduction
Developing crop varieties that maintain a consistent yield across different environmental conditions is an important target for plant breeding as weather patterns become more variable due to global changes in climate. Breeding for yield stability requires characterization of an individual plant's response to biotic and abiotic stress [1] relative to a breeding population. Treatment studies, where some individuals are subjected to different growing conditions than a control group, play an important role in uncovering the genetic potential for tolerance of stress. Such experiments include genotype-by-environment (G × E) studies, where the treatment is often abiotic stress, such as water-limited growing conditions, or genotype-by-management (G × M) studies, where the treatment is a different application of inputs, such as herbicide application to assess herbicide tolerance or nitrogen application to assess nitrogen use efficiency. A core challenge for this broad class of experiments is the ability to quantify and characterize the physical changes observed in the treated plant population relative to the control population, i.e., to phenotype a plant's response-to-treatment. A number of factors make response-to-treatment a difficult phenotype to quantify. In general, stress affects multiple plant traits simultaneously. Stressors can also have a substantially different type and magnitude of effect on different plant species and different cultivars within the same species. Finally, quantifying response and recovery to stress is sensitive to the timing of observations and often requires repeated observations over a plant's life cycle in order to capture important phenological features. An accurate and quantitative assessment of response-to-treatment is particularly important for genomic association studies.
The use of association mapping techniques, such as genome-wide association studies (GWAS), has yielded many candidate loci for agronomically important quantitative traits in plants [2]. For food crops, genome-wide analysis of susceptibility or tolerance to abiotic stress factors such as drought [3], nitrogen deficiency [4], salinity [5], or other factors leads to the discovery of genetic differences underlying these agronomically important characteristics. These treatment-based GWAS studies are capable of identifying tolerance alleles which could result in a tolerance to a wider variety of environmental conditions if, for example, introgressed into commercial cultivars. GWAS studies, however, require large datasets of phenotypic data in order to map associations with genomic data [6].
High-throughput phenotyping (HTP) technologies have advanced rapidly in the past five years to meet the demand for large phenotypic datasets. Recently, image-based HTP has gained popularity, because photographing plants in greenhouses or fields with robots and drones has allowed data collection at yet larger scales. The phenotyping bottleneck has shifted from collecting images, which can now be done routinely, to making sense of those images in order to extract phenotypic information. Although there is a wide selection of software tools available for extracting phenotype information from images [7,8], the design and implementation of specific phenotyping pipelines is often required for individual studies due to inconsistencies between datasets. This is true of both traditional image analysis where thresholds and parameters need to be adjusted and more recent machine learning techniques which require the timeconsuming manual annotation of training data. In addition, some phenotypes are difficult to measure from images, and ad hoc solutions tailored to a particular imaging modality or dataset are often required in place of more general ones.
To overcome the many challenges associated with imagebased phenotyping, we propose the Latent Space Phenotyping (LSP), a novel image analysis technique for automatically quantifying response to a treatment from sequences of images in a treatment study. LSP is related to a broad family of techniques known as latent variable models. These models have been previously used for modelling variation in image data via variational inference, using variational autoencoders (VAEs) [9]. LSP instead constructs a latent representation that best discriminates between image sequences of control and treated samples of a plant population and then measures differences among individuals within the latent space to quantify the temporal progression of the effect of the treatment. The key characteristic of LSP in comparison to existing image-based phenotyping methods is that the phenotype estimated using an image analysis pipeline is replaced with an abstract learned concept of the response-to-treatment, inferred automatically from the image data using deep learning techniques. In this way, any visually consistent response can be detected and differentiated, whether that response is a difference in size, shape, color, or morphology. By abstracting the visual response to the treatment, LSP is able to detect and quantify complex morphological changes and combined changes of multiple phenotypes which would not only be extremely difficult to quantify using an image processing pipeline, but may not even be apparent to a researcher as correlating with the treatment. In this study, we use a combination of natural and simulated datasets to demonstrate that LSP is effective across different plant species (Setaria, sorghum, Brassica napus L., and simulated Arabidopsis thaliana) and different types of treatment studies (drought stress, nitrogen deficiency, simulated changes in leaf elevation, and simulated changes in growth rate).

Materials and Methods
Latent Space Phenotyping consists of a three-stage process as illustrated in Figure 1. First, we train an embedding network to classify samples as either treated or control based on a sequence of images captured over their growth cycle (Figure 1(a)). During the training process, the embedding network learns image features that best capture how the plants in the dataset respond to the experimental treatment, e.g., drought stress or nitrogen deficiency. These embeddings form an n-dimensional latent space, where individual plant images are embedded as abstract n-dimensional points. Second, we train a decoding network to perform the reverse process of projecting embedded points from the latent space back to images, in order to obtain a meaningful representation of the latent space ( Figure 1(b)). Finally, we measure response-to-treatment for individual accessions by tracing their path through the latent space from the initial to final time points in their growth cycle. Treated and control replicates of the same accession are expected to have different paths through the latent space. For example, a droughtstressed individual may have a "shorter" path than its control counterpart, or the paths for a treated and control sample may start at similar locations in latent space but diverge by the final time point in their growth cycle due to visual differences caused by the treatment. Importantly, the embedded paths for control and treatment samples of the same accession are traced and measured in image space (by mapping the path through the decoder) so that these differences are physically meaningful (Figure 1(c)). The differences between treated/control samples represent a phenotype for responseto-treatment that is derived directly and automatically from the original image dataset for the experiment. The response-to-treatment phenotype can be used as a trait value with any existing genome-wide association software tool or interpreted as an objective response rating, e.g. herbicide tolerance, to inform plant breeding decisions. A complete implementation of LSP, called LSP-Lab, is provided at https://github.com/p2irc/lsplab.

Dataset Requirements.
Performing an LSP analysis requires an image dataset, comprised of images taken at an arbitrary number (U) of time points during cultivation for each individual in each of the treatment and control conditions. There should be no missing time points; otherwise, the entire sequence cannot be included. Although sequences of differing length within the same experiment could be used in principle, this has not yet been observed and so we do not support this in our implementation to avoid unspecified behavior. The initial time point should ideally be zero days after stress (DAS), in order to establish this as the baseline for determination of the effect of the treatment. The provided implementation is capable of splitting the analysis into sections of time, for multiphase experiments. Controlled imaging (using imaging booths, stages, or growth chambers) is recommended in order to maintain consistency in image characteristics such as distance from the camera and the position of the specimen in the frame. However, the method is robust to noise in the images (such as variations in lighting) as long as the noise is consistent between both treatment and control samples, not specific to one condition.

Embedding Network.
In order to measure a plant's response-to-treatment, it is first necessary to determine which visual characteristics in the images indicate the presence of this effect. To learn the visual features correlating with treatment, LSP utilizes a learned projection of images from a population into an n-dimensional latent space, a process known as embedding. The embedding is shaped by a supervised learning task, which trains a convolutional neural network (CNN) to extract visual features relevant to the discrimination of treatment and control samples.
Performing this embedding allows the method to learn the latent structure of the response and gives the method the ability to overlook any morphological or temporal characteristics that may be different between accessions but do not correspond to response to the treatment.
The process of training the embedding network requires only treatment/control labels for each sample. The input to the training process is a sequence of images taken for each individual in the treatment and control conditions. The genotypes are divided into training and validation sets with a random 80-20 split. Images are standardized by subtracting the mean pixel value and dividing by the standard deviation and then used as input to a CNN. For each time point image, the activations of the last fully connected layer in this CNN are used as the input to a Long-Short Term Memory (LSTM) network ( Figure S1).
We describe both CNNs and LSTMs briefly here but refer to the literature for more detailed summaries of deep learning in general and these network variants in particular [10][11][12]. A CNN can be used to learn local feature extractors from image data. The capability of a CNN to learn a complex representation of the data in this way allows the technique to perform well in many complicated image analysis tasks, such as image classification, object detection, semantic and instance segmentation, and many other application areas [10].
CNNs have been used extensively in the recent literature on image-based plant phenotyping, showing promise in several areas, including disease detection and organ counting [12][13][14][15][16]. For the process of learning an embedding, we implement a simple four-layer convolutional neural network as  The process consists of three phases, which take place in sequence. First, an embedding process projects images into latent space. Second, a decoder is trained to convert these embeddings back into the input space. Lastly, the decoder is used to calculate a geodesic path between the embeddings for the initial and final time points. described in Table S1. Larger architectures were tested and found to show no difference in the experiments reported in this study. Recurrent neural networks (RNNs) are an extension to neural networks which allows for the use of sequential data. RNNs are a popular tool for time series, video, and natural language problems, for which sequence is an important factor. Briefly, RNNs maintain an internal state which is updated through the sequence, allowing them to incorporate information about the past into the current time point. LSTMs are an extension to RNNs which incorporate a more complicated internal state which is capable of selectively retaining information about the past. LSTMs have also appeared in the plant phenotyping literature, demonstrating that they are able to successfully learn a model of temporal growth dynamics in an accession classification task [11]. LSTMs have also been used as a model of spatial attention in the segmentation of individual leaves from images of rosette plants [17].
The final time point of the LSTM feeds into a two-layer feed-forward neural network, the output of which uses the treatment/control labels of the training images as classification targets, using a standard sigmoid cross entropy loss for training. The loss on the validation set is monitored during training to detect whether the embedding network has learned a general concept of the response to the treatment, as opposed to simply overfitting the training data. For the purposes of our application, we prefer embeddings which create only the minimum variance in the latent space necessary for performing the supervised classification task. That is, we prefer embeddings for which variance in most dimensions is close to zero. This helps the subsequent phase of training (Section 2.3) to recover differences in the images which correspond to a generalized concept of response-to-treatment, instead of learning features which are specific to one sample or to a group of samples. To incentivize this, we include an additional loss term for the embedding process alongside the cross entropy loss and L 2 regularization loss, called the variance loss (L v ), where E is the mean-centered matrix of embeddings for a batch of m sequences of U images. The result is sometimes called the generalized variance [18]. In addition, we add a small constant λ v to the diagonal of C. This is for two reasons-first, it prevents the case where zero variance in a dimension causes C to be noninvertible, stopping training. Secondly, it stops the optimization from shrinking the variance in one dimension to an infinitesimally small value, effectively pushing the determinant to zero regardless of the variance in the other dimensions and allowing the optimization to ignore the L v term altogether. Ordinarily, we would find it necessary to restrict the multidimensional variance in the latent space by constraining the size of the latent space n to the minimum size necessary for convergence. We find that using the variance loss term allows us to use a standard latent space size of n = 16 for all experiments, and individual datasets will utilize as few of these available degrees of freedom as necessary as dictated by this term in the loss function. A value of λ v = 0:2 was used in all experiments. The Adam optimization method [19] is used for training with an initial learning rate of 1e − 3.
After the network has finished training, the images of the training and validation sets are then projected into latent space. This embedding is given by the activations of the final fully connected layer in the CNN. In this way, each of the images in each of the treatment and control sequences can be encoded as n-dimensional points in the latent space. The final result of the embedding step is all images projected into the same n-dimensional space, which can be visualized using a dimensionality reduction technique (here we use PCA). The embedding plot is used only for visualization purposes, since distances on the embedding plot do not correspond to semantic distance between samples, an issue discussed in Section 2.4. Creating an embedding plot with exact distances between accessions would require calculating on the order of ðUmÞ 2 pairwise paths in the latent space, which is intractable. However, generating the embedding plot using Euclidean distances between embeddings often illustrates stratification of samples in the latent space, albeit with approximate accuracy.

Decoding Network.
The second phase of the method involves training a decoder which performs the same function as the embedding process described in Section 2.2, but in reverse. The purpose of the decoder is to define the mapping from latent vectors to image space, discovering the latent structure in the image space, and allowing us to calculate paths in the latent space during the subsequent phase (Section 2.4). The structure of the decoder network consists of a series of convolutional layers followed by transposed convolution layers, which increase the spatial resolution of the input (Table S2). This architecture is similar to those used in other generative tasks, with the exception that there is no linear layer before the first convolutional layer, to prevent the decoder from overfitting. Samples in the training set are projected by the finalized embedding CNN into the latent space, and then the decoder projects these latent space vectors back into the input space ( Figure S2). A reconstruction loss function quantifies the difference between the original image and its reconstruction provided by the decoder in terms of mean squared error (MSE). Compared to training the embedding network, a lower learning rate of 1e − 4 is used for training the decoder. Since the embeddings are derived from the supervised classification task, the only features which are encoded in the latent representation are those which are correlated with the response to the treatment. For example, in Figure S2 (middle), the induced angle of the synthetic rosette (the plant leans slightly to the left) is not reflected in the decoder's prediction, since plant angle is not encoded in the latent space due to it not being correlated with the simulated response-to-treatment. The leaf elevation angle, however, does match between the real and predicted images. More examples of encoded and decoded images are shown in Figure S3. In practice, the decoder's output for an input with support in the latent space will tend towards the mean of all images which embed to a point near this location. This mean image should be free of the specific characteristics of any particular accession or individual. The use of MSE creates decoded images which appear blurry-this is an expected result and helps produce smooth interpolations in the image space when calculating paths in the latent space as described in Section 2.4.

Measuring
Response-to-Treatment Using the Latent Space. In the third and final part of the process, we seek to quantify the change in the decoder's output as we travel in the latent space over time, with respect to the embeddings of the images at each time point for a given individual. In other words, we are interested in characterizing the semantic distance between decoded images at the initial and final time points-that is, the distance between these images in terms of stress propagation. This characterization of semantic distance needs to be considered in terms of the geodesic path on the latent space manifold, rather than Euclidean distance in the latent space or in the image space. Figure S4 illustrates the difference between the Euclidean distance and the geodesic distance in a hypothetical latent space for a toy example.
In Section 2.3, we defined a decoder (or a generator function) g : Z ⟶ X where Z is the latent space and X is the input space of the CNN ( Figure S2). Since g is trivially nonlinear, this implies that Z is not a Euclidean space, but a (pseudo-) Riemann manifold [20]. The geodesic distance of a path γ on a latent space Riemann manifold mapped through a generator g in the continuous case is given by mapping the path through g via the Jacobian J γ [20]. Minimizing this path in the discrete case can be accomplished by optimizing the squared pairwise distance between a series of intermediate path vertices, minimizing arg min where h is a difference function [21]. Performing this optimization on the latent spaces generated by LSP is possible using a standard choice for h such as the L 2 distance in the image space, since this distance in the image space of the decoder is what we seek to optimize when determining paths through the latent space. Using the embeddings of the images for the initial and final time points provide the start and end points for a path through the latent space.  [21], we start from a linear interpolation between stationary vertices. This allows us to perform the optimization all at once, instead of dividing the task into multiple successive optimizations which is potentially more expensive. Calculating the total path length as in the sum in Equation (3) describes the individual in a single unitless scalar value, indicating the difference in semantic distance travelled over the course of the treatment. Intuitively, the process can be thought of as tracing a path through latent space from where the initial time point embeds to where the final time point embeds. In order to find this path, the current position in latent space is decoded into image space by the decoder. Then, the position is moved in the direction which creates the smallest change in this decoded image. As the path is traced in this way, watching the output of the decoder reveals a smooth "animation" where the number of animation frames corresponds to the number of path vertices. The trait value corresponds to how much change there is between each frame and the next, summed up over the entire path. It is important to note that the distance travelled in latent space is irrelevant-the measurement occurs in the output space of the decoder.
When tracing the geodesic path between the first and final time points as described, we refer to this as the longitudinal mode. However, it is also possible to perform a crosssectional analysis for populations where there is one treated and one control sample for each accession by tracing a path between the final time point for the treated sample and the final time point for the control sample. Results for both of the experiments on synthetic data presented here are performed in cross-sectional mode, although longitudinal mode also provides significant results on these datasets. The natural datasets are run in longitudinal mode to match the format of the original study design.

Results
We evaluated the proposed method using three natural datasets across different plant species and different treatment types, including a population of recombinant inbred lines of Foxtail grass (Setaria) treated with drought stress [3], a panel of sorghum treated with nitrogen deficiency [22], and the founders of a nested association mapping population of canola Brassica napus treated with drought stress. We performed two additional experiments using synthetic datasets, including a model of Arabidopsis thaliana, where ground truth candidate loci were verified.
3.1. Setaria RIL (S. italica x S. viridis). We used a published dataset of a recombinant inbred line (RIL) population of the C 4 model grass Setaria ( Figure S7) [3,23]. The dataset includes drought and well-watered conditions and has been used to detect QTL relevant to water use efficiency and drought tolerance [3,24].
The dataset was used as provided by the authors of the original study [3] with a few modifications. The image data was downsampled to 411 by 490 pixels, to allow for a more practical input size for the CNN. Since the camera varies levels of optical zoom over the course of the trial, it is also necessary to reverse the optical zoom by cropping and resizing images to a consistent pixel distance. In order to minimize the effect of perspective shift, the plants were cropped from the top of the pot to the top of the imaging booth, between the left and right scaffolding pieces. This effectively removes the background objects and isolates the plant on a white background. Removing the background is not necessary in the general case-that is, if the background does not change over time. However, since the optical zoom creates differences in background objects, it is practical to remove the background to remove this potential source of noise. The February 1 st time point was selected as the initial time point, since many of the earlier time points were taken before emergence. In total, 1,138 individuals representing 189 genotypes and six time points were used. The SNP calls were used as provided by the authors, resulting in a collection of 1,595 SNPs for this experiment. The latent distance values generated by the proposed method were used as trait values for the multiple QTL biparental linkage mapping pipeline provided by the authors of the dataset, in order to replicate the methodology used in the published results.
A histogram of latent distance values for individuals in each of the water-limited and well-watered conditions is shown in Figure 2(a). A total of four QTL were detected with respect to the ratio of the trait under the control condition to the trait under the treatment condition. However, we discard these QTL as potentially spurious under the guidance of the original paper, which found that most of the QTL found using the ratio of the trait values were not also recovered using the difference in trait values [3]. For the difference in trait values between conditions, we identified two QTL associated with drought tolerance in the Setaria RIL population (Figure 2(c)). These loci are reported by Feldman et al. as corresponding to plant size and water use efficiency ratio (5@15, within the 95% confidence interval of the reported peak of 5@13.7) and plant size and water use efficiency model fit (7@34). Although we were only successful in replicating two of the genotype-by-environment QTL from the published study, many of these previously reported QTL  correspond to a water use model incorporating evapotranspiration; not a single trait derived directly from the images such as vegetation area. The normality criterion for ANOVA is violated, and so we use a nonparametric Kruskal-Wallis test.
Running this test, we observe high significance for the effect of genotype on the trait (p = 3:77 −9 ). The same was found for the interaction (p = 2:2 −16 ).

Sorghum (S. bicolor).
For this experiment, we used an existing study of nitrogen deficiency in sorghum [22]. The authors of the dataset applied a nitrogen treatment to a panel of 30 different sorghum genotypes. Individuals were placed into the control condition with 100% ammonium and 100% nitrate (100/100), 50% ammonium and 10% nitrate condition (50/10), or 10% ammonium and 10% nitrate condition (10/10). Images were analyzed with respect to various shape and color features to detect the presence of a response to the treatment. No association mapping was performed in the published study, and so GWAS results are not provided here. Due to the small size of the dataset, we use data augmentation to help prevent overfitting. This involves introducing random horizontal flips, randomly adjusting brightness and contrast, and cropping to a random area of the image during the training of the embedding network. Images were downsampled to 245 by 205 pixels.
In the published study, the authors found that a PCA of 17 different shape features was able to distinguish the control condition (100/100), the high-intensity treatment (10/10), and the low-intensity treatment (50/10). A PCA of various hue and intensity features of segmented vegetation pixels was able to distinguish between the 100/100 and 10/10 treatments but unable to distinguish between the 50/10 and 10/10 conditions. An LSP analysis of the same dataset was able to distinguish between the 100/100 and 10/10 conditions (Figure 3) but failed to converge when tasked with differentiating between the 50/10 and 10/10 treatments. This implies that differences between the two lower nitrogen conditions were too subtle to be detected by either LSP or the collection of pixel intensity features. The LSP method could be adapted to analyze all three conditions in a single experiment by replacing the sigmoid cross entropy operation in the encoder with a softmax cross entropy operation. However, the analysis was split into pairs of conditions because the nonconvergence of the 50/10-10/10 pair also prevents convergence of the three-condition experiment.

Canola (Brassica napus).
Next, we performed validation on the founder panel of a nested association mapping population of B. napus ( Figure S8). In total, 50 genotypes were used in three replications in each of the treated and control conditions, for a total of 300 individuals. Images were taken daily during the early growth period and subsequently every other day and were downsampled to 245 by 205 pixels. As with the previous datasets, the plants were imaged in a LemnaTec indoor plant imaging system for a total of 40 days and treated individuals were subjected to a drought treatment. In contrast to the Setaria RIL experiment, the canola study involved three phases. First, an initial growth phase which lasted 14 days where no treatment was applied. Next, a 20-day drought phase was applied where watering was reduced from 100% to 40% field capacity, while the pots were imaged every two days. Lastly, a 6-day recovery phase took place where individuals were again watered uniformly across conditions. The results of the LSP analysis are shown in Figure 4. Because this experiment involved three distinct treatment phases, the analysis was performed on each of the three relevant portions of the latent space path in series. This gives a separate set of results for each phase, where the performance of individuals can be assessed within each phase. Interpolation was not used on this dataset as it already contains the target number of 30 path vertices. As with the sorghum trial, the NAM panel of B. napus is too underpowered to find QTL underlying tolerance to drought. However, the trait value output by the proposed method distinguishes between the two conditions in the treatment phase. Phenotypic response as measured by geodesic distance was also readily observed when inspecting additional factorial variables, such as flowering time ( Figure 4(b)). It has been established that drought affects productivity in canola differently depending on the developmental stage [25][26][27][28], In order to determine whether multiple replications of the same line were clustered together in the output, a onetailed F test was performed using the within-group variance for each of the 100 genotype-treatment pairs. For the pretreatment, treatment, and posttreatment stages, we found that 26, 34, and 29 of the 100 groups were significantly grouped in the output, respectively (p < 0:05). We also explored the effect of treatment by running two-way ANO-VAs for the interaction of genotype and treatment on the trait value (p = 6:9 −4 ) as well as flowering time group and treatment on the trait value (p = 2:29 −8 ).

Synthetic Arabidopsis thaliana Model.
Synthetic images of rosettes [29] and roots [30] have been used previously to train models for phenotyping tasks. Here we used synthetically generated image data as it allows us to introduce specific variance in the imagery based on a simulated casual SNP and then investigate the method's ability to recover that variance on the other end by running a GWAS on the simulated population. We use FaST-LMM [31] to perform this analysis and generate the Manhattan plots.
For this purpose, we used an existing L-system-based model of an A. thaliana rosette [29], based on observations and measurements of the development of real A. thaliana rosettes [32]. The model was run in the lpfg simulation program [33], which simulated the development of the plant over time, and rendered the resulting images. We selected seven of these images corresponding to different time points of the simulation for the LSP analysis.
To generate the synthetic A. thaliana genetic dataset, we begin from a real A. thaliana genotype database known as the A. thaliana polymorphism database [34]. This dataset includes 214,051 SNPs for 1,307 different accessions of A. thaliana. A single causal SNP was chosen at random, and we let that SNP represent a polymorphism which confers tolerance to a hypothetical treatment that affects the plant's leaf elevation angle. The elevation angle of the plant's leaves is sampled from a normal distribution which is parameterized according to whether the sample is untreated, treated-andresistant, or treated-and-not-resistant. Figure S5 shows the effect of the simulated treatment where the angle of the leaves on the treated plant is increased relative to the untreated sample. Other parameters in the model, such as growth rate, are normally sampled for each accession. It should be noted that, the growth rate of the simulated A. thaliana plant is completely uncorrelated from the treatment, as are multiple other model parameters. This means that, although the effect of the treatment is still visually apparent, the embedding network must learn a complex visual concept and cannot rely on measuring the number of plant pixels to discriminate between treated and untreated samples. Since the leaf elevation is modulated as a function of plant maturity, the effect of the treatment is not visible in plants with a low growth rate, adding considerable noise and further increasing the complexity of the task. Also note that performing phenotyping on this image dataset would be challenging, since estimating leaf angle from images is a nontrivial image processing task, especially in the absence of depth information [35,36]. The method is able to successfully determine the simulated causal locus on chromosome one with no false positives ( Figure 5). Figure S6 shows a comparison between the proposed method and a naive solution where the image distance (Euclidean distance in the pixel space) is calculated between each pair of images, with no embedding or decoding step. Such an approach would be successful on the simple synthetic imagery described in Section 3.5 but fails in this more complex case.
3.5. Synthetic Circle Model. Lastly, we performed an experiment with synthetic imagery intended to show how the LSP method performs under basic conditions and how its outputs relate to manually measured phenotypes in this case. For this purpose, we use a simple model of the A. thaliana rosette which depicts individuals as white circles on black backgrounds, with a hypothetical treatment causing a decreased growth rate of the circle over time in this simple model.
For each of the control and treatment conditions, a sequence of six time points is generated, with images representing a circle growing from an initial diameter (sampled from a normal distribution) to a final diameter. The growth rate of the diameter is drawn from a normal distribution, parameterized according to condition. Additionally, the growth rate under the treated condition is influenced by seven hypothetical QTL drawn from a Bernoulli distribution, as well as the presence of the minor allele at a randomly chosen locus in the SNP data. The ground truth growth rate values were used to generate the synthetic imagery. Performing an LSP analysis of this dataset allows us to forego phenotyping and use the synthetic image data directly. The embedding plot representing the learned embedding of the image data as well as the Manhattan plot is shown in Figure 6. LSP is able to recover the simulated causal locus with no false positives in this simple application. Relating LSP to the established method of using image processing to extract the growth rate phenotype, we examine the correlation between pairwise distances in the latent space and differences in measured phenotype between the same accessions. There is significant correlation between calculated geodesic distances in the latent space and the relative growth in the number of white pixels in the synthetic circle dataset (R = 0:93, p < 0:01).

Discussion
The Latent Space Phenotyping method as described has some limitations, including increased computational requirements compared to the majority of image-based phenotyping techniques. Since the method involves multiple deep neural networks, the use of GPUs is advisable to perform these optimizations in a tractable amount of time. The experiments presented here were performed on two NVIDIA Titan V GPUs, and the time required per experiment ranged from two to eight hours depending on the number of accessions and the number of time points in the dataset. Beyond computational requirements, another limitation of the method is a substantial difference in interpretability compared to GWAS using standard image-based phenotyping techniques. The traits measured with these standard techniques often have a direct and interpretable relationship with the response to the treatment-for example, it has been shown that the number of plant pixels in an image can be used as a proxy for bio-mass [37]. Therefore, the measured phenotype can be directly interpreted as the biomass of the sample and QTL can be found which correlate with the effect of the treatment on biomass. In the case of LSP, the individual's response to the treatment is abstracted and quantified only relative to other individuals in the dataset. Interpretability techniques such as saliency maps [38] ( Figure S9) can help to elucidate relevant regions in the images, but the measurements still lack a direct biological interpretation in the same way as measurements of biomass. Therefore, candidate loci obtained through LSP must be interpreted differently, and biological explanations must be inferred from the function of the detected loci. In addition, since the method is nondeterministic due to randomized initial weights and random minibatching (as with all deep learning methods), repeating the same experiment may output different results.
Although there is no guarantee that the trait values reported by the method will be consistent between runs, we found the reported QTL to be consistent across runs for both synthetic datasets. However, a repeat of the Setaria RIL experiment resulted in a similar histogram and a betweencondition p value on the same order as the results reported in Figure 2, but both previously detected QTL fell below the significance threshold. This is an inevitable consequence of using a nondeterministic method. However, it should be noted that deterministic methods are not inherently repeatable either-different thresholds, outlier detection methods, and data transformations affect the detected loci in these cases. It should also be noted that, although we have endeavoured to present a range of datasets in this work, we have still only scraped the surface of plant phenotyping image data. It remains to be seen how the method responds to other plants with significantly different architectures to those shown here. Of particular interest for future work are plant and root structures with highly branched and articulated forms. Also, although the method is theoretically designed to be robust to the visual differences between genotypes, it is unknown how the method responds if these differences are significantly larger than the differences due to treatment. The testing of LSP in less controlled imaging contexts such as in outdoor field conditions is an important direction for future work. LSP should also be validated in experiments with larger natural datasets, involving thousands of genotypes. Since the computational expense scales linearly with the number of individuals, such experiments should be feasible. This work is related to previously described methods which are capable of automatically quantifying differences in morphology between individuals, notably the persistenthomology (PH) method [39]. While PH is focused on automatic shape description, LSP instead learns temporal models of stress which can be dependent on, or independent from, shape. Additionally, PH techniques usually involve the design of a unique system for each shape description task; LSP aims to provide a completely general technique which is not tailored to any particular dataset.
Finally, performing the embedding step can be seen as a type of dimensionality reduction, from the highdimensional image space to a lower-dimensional embedding space. Doing dimensionality reduction in images has been performed before using techniques such as principal component analysis (PCA) or autoencoders. By performing dimensionality reduction on images, it is possible to recover factors which correspond to pixel variance in the images. For example, the Eigenface technique [40] uses PCA on small, greyscale images of faces to learn a series of principle components (PCs). A new image of a face can be encoded as a linear combination of these PCs, and this representation can be used to compare the new face against a database of examples to determine similarity. While methods such as Eigenfaces provide a feature vector describing the most major points of variance in the image space, LSP specifically avoids this approach. This is because the most major variations in the images are likely to be from sources completely unrelated to the treatment. For example, the emergence of a new organ creates significant variance in the images, even if the emergence of that organ is not due to the treatment. Using methods such as PCA or autoencoders results in an arbitrary number of features, some or none of which may be useful to the description of the effect of the treatment. Attempting to embed images using other manifold learning techniques such as Multidimensional Scaling (MDS) or Locally Linear Embedding (LLE) suffers from a similar problem-they will attempt to preserve likely meaningless pairwise relationships in the pixel space.
Let us imagine that one is able to accept the above shortcomings of dimensionality reduction methods such as PCA and autoencoders. Performing the analysis on the full dataset is likely to mostly identify features related to maturity, since this is often the largest cause of variance in the images (and using the full dataset with techniques such as PCA which do not use minibatching is likely intractable due to memory restrictions). To circumvent this, one could imagine taking the high-dimensional features provided by such methods in separate analyses at each time point. Although most of these features are likely irrelevant, one could use nonparametric significance testing to determine which of the features are correlated with the presence of the treatment. This was validated on both synthetic datasets and the B. napus dataset using a Mann-Whitney U test. Various PCs were shown to contain information relevant to the condition and even appeared as describing the most variance (PC1 and PC2) during most of the treatment phase of the B. napus trial. However, a more subtle problem with a simplistic latent model and lack of a temporal component is that there cannot be a measurement of the progress of a single, continuous, nonlinear process through time, especially if that process contains multiple different stages (such as wilting, followed by senescence). Both PCA and autoencoders are able to encode images and provide reconstructions, making it possible to determine difference in the pixel space given two encoded samples. However, differences in the pixel space can only be calculated between two individuals at discrete time points, and the evolution of these differences across time points cannot be assessed. LSP, on the other hand, integrates stress and maturity in a single continuous space which can be smoothly interpolated through. This space can represent complex, continuous, and nonlinear changes in different regions. Although PCA was able to detect the presence of the treatment in both the synthetic datasets and the B. napus experiment, the scores on these significant PCs predictably failed to discriminate between the treatment and control conditions across time.
The results of five experiments demonstrate the capability of LSP to automatically form accurate learned concepts of response-to-treatment from images and recover QTL with a very low false positive rate. As an automated system, the proposed method is exempt from the considerable challenges which arise in developing and deploying image analysis pipelines to first measure phenotypes from images. It is also free from a priori assumptions about which visually evident features are caused by the treatment, automatically detecting area, leaf angle, drought stress, and nitrogen stress in five different experiments. Replicating more candidate loci from existing studies will help continue to validate the technique and encourage further study on latent space methods in the biological sciences.