Crop Water Status Analysis from Complex Agricultural Data Using UMAP-Based Local Biplot

: To optimize growth and management, precision agriculture relies on a deep understanding of agricultural dynamics, particularly crop water status analysis. Leveraging unmanned aerial vehicles, we can efficiently acquire high-resolution spatiotemporal samples by utilizing remote sensors. However, non-linear relationships among data features, localized within specific subgroups, frequently emerge in agricultural data. Interpreting these complex patterns requires sophisticated analysis due to the presence of noise, high variability, and non-stationarity behavior in the collected samples. Here, we introduce Local Biplot, a methodological framework tailored for discerning meaningful data patterns in non-stationary contexts for precision agriculture. Local Biplot relies on the well-known uniform manifold approximation and projection method, such as UMAP, and local affine transformations to codify non-stationary and non-linear data patterns while maintaining interpretability. This lets us find important clusters for transformation and projection within a single global axis pair. Hence, our framework encompasses variable and observational contributions within individual clusters. At the same time, we provide a relevance analysis strategy to help explain why those clusters exist, facilitating the understanding of data dynamics while favoring interpretability. We demonstrated our method’s capabilities through experiments on both synthetic and real-world datasets, covering scenarios involving grass and rice crops. Moreover, we use random forest and linear regression models to predict water status variables from our Local Biplot-based feature ranking and clusters. Our findings revealed enhanced clustering and prediction capability while emphasizing the importance of input features in precision agriculture. As a result, Local Biplot is a useful tool to visualize, analyze, and compare the intricate underlying patterns and internal structures of complex agricultural datasets.


Introduction
The accurate assessment of crop water status, which refers to the level of hydration within a plant, is critical in precision agriculture (PA) for water-intensive crops.Furthermore, climate change necessitates optimizing water usage to meet increased drought threats [1,2].By monitoring crop water status indicators, such as soil moisture and plant stress, and understanding crop responses, we can tailor irrigation practices [3,4].When it comes to PA, understanding how temporal or conditional variations in different factors can significantly impact crop growth, productivity, and overall agricultural management is essential [5].Still, dynamic changes in soil moisture due to spatial and temporal variability, life cycle patterns, plant water uptake, environmental aspects, and irrigation practices can exhibit non-stationary behaviors, meaning they do not follow a fixed distribution or consistent patterns over space and time.For example, in rice crops, temperature fluctuations, soil moisture levels, and day length can dynamically influence flowering time and other plant properties [6].Thus, addressing non-linear and non-stationary patterns in agricultural data analysis is essential for accurately assessing water status, improving decision-making, and effective agricultural management [7,8].
Conventional methods like soil moisture sensors, leaf-level measurements, laboratory analysis, and manual field surveys are often time-consuming and labor-intensive.Recent developments in unmanned aerial vehicle (UAV)-based remote sensing (RS) techniques make data collection for crop characterization and monitoring more efficient, as they are non-invasive, non-destructive, accurate, and cost-effective [9].By combining the different wavelengths of light that plants reflect and absorb, vegetation indices (VIs) provide valuable insights, such as canopy biomass and chlorophyll content [10].Nonetheless, effectively extracting useful information from the large volumes of samples generated by integrating field data with high-resolution remote and proximal sensors can be cumbersome [11].Additionally, noise, data source conflicts, and spatiotemporal UAV disparities caused by weather changes and sub-optimal sampling further complicate the training of accurate and reliable models [12,13].
As agricultural research evolves, different techniques have emerged to conveniently explore and organize data to extract valuable knowledge [14].These approaches include descriptive and exploratory analysis [15,16], clustering [17], multivariate analysis for exploring inter-variable relationships [18], time series analysis for studying temporal patterns [19], and predictive modeling [20,21].Visual representations, such as biplots [22,23], are typically the preferred method for achieving a 2D plot that is immediate, direct, and simple to comprehend for both input feature and sample relationships in a low-dimensional space.The latter assists in the identification of critical variables, resulting in the completion of duties such as cluster visualization, correlation highlighting, and feature selection.While traditional biplots remain fundamental, advanced statistical tools have emerged to address some of their limitations, focusing on genotype-by-environment interactions to highlight superior crop varieties.Thus, their suitability depends on the specific research question and data characteristics [16,24].Additionally, traditional statistical methods face significant challenges when dealing with the complexities inherent in high-dimensional agricultural datasets [25].One of the primary constraints is their inability to accurately represent the true dynamics of agricultural processes due to their difficulty with non-linear relationships.Then, the variables frequently interact in complex and non-linear ways, resulting in oversimplified models [26].
Here, we introduce the Local Biplot methodological framework, which uses 2D data visualization and input feature ranking within localized clusters to identify meaningful patterns, with a specific focus on water status analysis in multi-temporal agricultural data.Our Local Biplot employs a uniform manifold approximation and projection (UMAP)-based algorithm to embed the input data within a 2D feature space dealing with nonlinear and non-stationary agricultural data dynamics [27].Then, the well-known K-means algorithm is used to cluster the samples from the UMAP 2D space.Further, to provide a complete picture of the local relationships between the variables and samples, a local affine transformation is then applied to map the input feature variability-based rankings to the 2D low-dimensional space.Hence, this framework encompasses variable/observation contributions within individual clusters in the same figure, facilitating the understanding of data dynamics to overcome pressing agricultural challenges such as climate variability, unsustainable agricultural practices, and inefficient use of water resources.Local Biplot is tested on both synthetic and real-world datasets.In particular, forage grasses and rice crops are tested to highlight relevant agricultural patterns related to water status studies in PA.Moreover, to investigate the influence of non-stationary data dynamics and inter-cluster relationships in the assessment of water content-related variables, we conducted experiments using random forest (RF) and linear regression (LR) models to estimate crop water status variables, such as the breeding score for grass and canopy water content (CWC) for rice.
The agenda for this paper is as follows: Section 2 describes the materials and methods.Section 3 present the experiments and results and Section 4 discuss the results obtained.Finally, Section 5 outlines the conclusions and future work.

Biplot Fundamentals
Let X ∈ R N×P be an input matrix with centered and standardized P-dimensional features and N samples represented by row vectors x n ∈ R P .Thus, X can be decomposed as X = USV ⊤ , where U ∈ R N×M and V ∈ R P×M are orthonormal matrices, and S ∈ R M×M is diagonal with non-negative elements.This singular value decomposition (SVD) allows for a low-dimensional representation X = U M S M V ⊤ M , optimizing: Of note, the eigenvectors for the M highest singular values in S are held by U M and V M .In biplot analysis, U M S 0.5 M and S 0.5 M V ⊤ M , with M = 2, are constructed to visualize relationships between samples and features, respectively.These matrices project data onto a 2D space, highlighting input data clusters and feature linear dependencies.

Uniform Manifold Approximation and Projection (UMAP)
Given the high-dimensional matrix X and the Euclidean distance function d(•, •) ∈ R + , UMAP aims to find a low-dimensional embedding Z ∈ R N×M that preserves both global and local neighborhoods from X, promoting the main non-linear data relationships.Then, a K-nearest neighbor (KNN)-based graph is built based on a local metric, yielding: where θ n ∈ R + holds the minimum distance within the n-th neighborhood with K neighbors x n ′ centered on x n .A localized entropy value σ n ∈ R + is computed by solving: Afterward, UMAP constructs a fuzzy simplicial complex, representing the highdimensional graph G = (X, W), where edges are defined by local connectivity through the weights in W ∈ [0, 1] N×N : Likewise, a low-dimensional weight matrix W ∈ [0, 1] N×N is computed as: where z n , z n ′ ∈ Z and α, ι ∈ R + adjust the preservation of local and global structures (typically set to 1).Therefore, we can formulate the UMAP's optimization problem, based on the cross-entropy loss, as follows: where the notation wnn ′ (Z) highlights the dependency between Z and their graph weights in W. It is worth mentioning that the optimization problem in Equation ( 6) balances attraction (first term) and repulsion (second term) forces based on the discrepancies between probabilities, e.g., graph weights.Moreover, it can be solved through gradient-descentbased approaches.Algorithm 1 outlines the main UMAP stages.
3: Compute the low-dimensional graph weights W as in Equation ( 5).4: Optimize the embedding space Z by solving Equation ( 6) through gradient descent.
5: return Low-dimensional feature space Z ∈ R N×M , M ≤ P.

UMAP-Based Local Biplot
We propose an explicit mapping between linear and non-linear 2D spaces to extend the concept of the classical SVD-based biplot to the analysis of localities and explore the internal non-linear data relationships.In particular, we introduce a twofold UMAP-based Local Biplot.First, we compute a non-linear embedding based on UMAP and further sample clustering.Second, a local SVD computation on each data cluster and an affine transformation for 2D visualization on the UMAP feature space are calculated.
Thereby, given an input matrix X, a 2D low-dimensional space Z is computed based on the UMAP algorithm (see Section 2.2).Then, instead of directly clustering the points in the original features, our approach focuses on clustering the latent feature space.This involves partitioning the data into R disjoint sets { Zr ∈ R N r ×2 } R r=1 , where each cluster is represented by the centroid µ r ∈ R 2 and ∑ R r=1 N r = N, R ≤ N. Consequently, the well-known K-means clustering algorithm is applied by solving [28]: Next, for a given cluster Zr and its corresponding high-dimensional samples in X r ∈ R N r ×P , a 2D SVD-based decomposition is carried out as: X r = Ũr Sr Ṽ⊤ r , where Ũr ∈ R N r ×2 and Ṽr ∈ R P×2 gather the left and right orthonormal basis regarding the two highest singular values in the diagonal matrix Sr ∈ R 2×2 .Then, the linear projection for the r-th cluster is computed as: Zr = X r B r , where: B r = Ṽr S0.5 r ∈ R P×2 .In turn, to make a unified visualization, we implement cluster-based affine transformations to line up and accurately show both the non-linear data relationships from the UMAP embedding in Z and the localized input feature-based basis in B r .Namely, the matched basis matrix Br ∈ R P×2 is written as: Br = γ r B r + ν r , where γ r , ν r ∈ R encode a composition of rotation, dilation, shears, and translation-based linear functions as [29]: where Br (γ r , ν r ) describes the dependency of Br regarding the affine transformation parameters.A Nelder-Mead simplex algorithm can be applied to solve Equation (8).Lastly, a localized feature ranking vector λ r ∈ R P can be computed as: being 1 an all-ones vector of proper size.

Tested Datasets
The tested datasets and critical experimental settings are detailed below.

Multivariate Gaussians
We generated a synthetic input feature matrix by randomly sampling three clouds, each containing 500 points (N = 1500) and five features (P = 5).Each cloud holds samples from a multivariate Gaussian, and each feature is within the range [0, 1].

Forage Grasses
The publicly available real-world dataset collected by Ghent University and the Research Institute for Agriculture, Fisheries, and Food (ILVO)-Belgium, provided in [30], is used to evaluate our approach.This database comprises 35 distinct VIs from five color spaces: RGB, CIE 1976 L*a*b*(CIELab), CIE 1976 L*u*v*(CIELuv), hue-saturation-value (HSV), and hue-saturation-lightness (HSL), for three categories of forage grass: festuca arundinacea (Fa), diploid Lolium perenne (Lp2n), and tetraploid Lolium perenne (Lp4n).The latter aims to identify drought-tolerant genotypes, as seen in Table 1.From the thermal data, ∆T and the crop water stress index (CWSI) were calculated.Additionally, a breeder score is provided for three distinct dates designated as T2, T4, and T5.The score ranges from one to nine, based on both biomass quantity and the verdant hue of the plant.The surface temperature in • C was calculated per plant for each flight day [30].Afterward, P = 37 features and N = 3174 samples are obtained.

GCC
Green Chromatic Coordinate Index [31] G R+G+B

BCC
Blue Chromatic Coordinate Index [31] B R+G+B

GBVI
Green Blue Vegetation Index [35,36] G−B G+B BRVI Blue Red Vegetation Index [30] The Tolima region of Colombia hosted the RiceClimaRemote research project.It was a collaboration between ILVO, the Universidad de Ibagué, and Agrosavia.The project focused on developing and implementing irrigation strategies for rice cultivation.Its goal was to identify methods that were best suited to the region's climate change conditions while still maintaining crop productivity.To achieve this, the project utilized technological tools and data analysis to monitor spatiotemporal variability at the sub-plot level.Field trials were conducted on a one-hectare plot cultivated with the Fedearroz 67 rice variety (Oryza sativa L.) at the Nataima Research Centre of Agrosavia.The research center is in the Espinal municipality of the Tolima region, Colombia (see Figure 2).Trials were conducted in two cycles, during the second semester of 2021 and the first semester of 2022.Three different irrigation techniques were established: multiple inlet rice irrigation (MIRI) [43], alternate wetting and drying (AWD) [44], and conventional flooded irrigation (CONTROL).The experimental area was divided into three strip plots, which enabled the analysis of each treatment.The multi-temporal image acquisition stage was conducted during sunny and cloudless weather conditions to monitor the crop status.During both the vegetative and reproductive stages, flights were executed biweekly, whereas during the ripening stage, flights took place weekly.RGB images were collected and aligned with multispectral images.Table 2 presents the multispectral and RGB indices obtained.In addition, to monitor the water status of the rice crop, various physiological parameters were measured.Gas exchange, including stomatal conductance (Gs), net photosynthesis rate (Pn), intercellular CO 2 concentration (Ci), and transpiration rate (E), were measured.Additionally, plant samples from a defined area were collected, and the leaf area index was indirectly determined by measuring the fresh and dry weight of a known leaf area.Furthermore, the equivalent water thickness (EWT) was calculated.Canopy water content (CWC) was then calculated using EWT and the leaf area index (LAI).Additionally, the photochemical reflectance index (PRI) was determined using a proximal sensor.In summary, an input feature matrix with P = 22 features and N = 768 samples are collected.

Training Details, Assessment, and Method Comparison
The baseline SVD-based biplot and our Local Biplot are tested to identify and visualize relevant variables and samples from input features in X.Moreover, we compute the Pearson correlation ϱ pp ′ ∈ [−1, 1] between features as follows: where is computed for a given matched basis matrix by replacing ξ p as the p-th row b ∈ R 2 of B in Equation (10).The feature relevance is also computed as in Equation ( 9).The latter aims to compare the input features linear relationships vs. our Local Biplot-based enhancement to code non-linear dependencies.The number of groups R is fixed as three, four, and five for the Multivariate Guassians, Forage Grasses, and RiceClimaRemote datasets, respectively.
Further, the LR and RF algorithms are used to predict the breeding score (Forage Grasses) and CWC (RiceClimaRemote).The goal is to train two regression models using the complete dataset and our Local Biplot framework to study non-linear and non-stationary behaviors in PA tasks.Next, to quantitatively assess the predictive performance on unseen samples, the coefficient of determination ( Ȓ2 ) is reported on the testing set within a five-fold cross-validation scheme.The Ȓ2 is defined as [52]: where y, ŷ ∈ R N gather the ground-truth and predicted outputs, respectively, and ȳ = 1 N ∑ N n=1 y n .A grid-search approach optimizes the hyperparameters of the LR and RF algorithms to sidestep overfitting.For the RF model, we tested different values for the number of trees {5, 10, 50, 200} and the maximum number of levels in each decision tree {5, 10, 50, 200}.In the case of LR, only the intercept parameter was tuned.All experiments were conducted in Python 3.10.12,with the Scikit learn 1.4.2API, in a Google Colaboratory environment.Our Python codes are publicly available at [53] (accessed on 21 March 2024).Regarding the Forage Grasses database, we use the publicly available data from [30] (accessed on 19 December 2023).The RiceClimaRemote dataset is not available to the public due to privacy considerations.

Multivariate Gaussians Results
We initially conducted a controlled experiment to evaluate the feasibility of our Local Biplot on synthetic data.Figure 3 displays a traditional SVD-based biplot next to our proposal.We represent features as arrows and depict observations as data points.Both projections are normalized between 0 and 1 for ease of interpretation and visual comparison.Furthermore, in Figure 4 (first row), a panel of input features with absolute Pearson correlation matrices showcases values for both the complete database and each cluster.The second row depicts the same analysis employing Local Biplot.

Forage Grasses Dataset Results
Figure 5 shows the visual inspection results on the Forage Grasses dataset.To illustrate, the basis (depicted as arrows) is presented.The 2D projections using the breeding score as color to provide further insights are also given.The principal components in the projections have been scaled to a range between 0 and 1 for easier interpretation and visual comparison.Then, we found the absolute Pearson correlation between each of the 37 indices and the breeding score.This is shown in Figures 6 and 7 for the SVD-based biplot baseline and our Local Biplot.Each panel displays correlations for individuals, for all species (ALL V), and for the three species (FA, Lp2n, and Lp4n) across all dates, taking into account both the original input features and clustered samples.Table 3 presents the Ȓ2 value for breeding score estimation using RF and LR models.These models were trained using the color space and RGB-based VIs from the grass dataset for all data, as well as for each cluster obtained with our Local Biplot.Figure 8 displays the input feature relevance analysis for the SVD-based biplot, our Local Biplot approach, and the regressor weights (LR and RF).Cluster-based relevance is also provided.For clarity, feature relevance is depicted between 0 and 1 based on a minmaxscaler [52].Table 3. Forage Grasses breeding score prediction results.Regression performance (average ± standard deviation) regarding the Ȓ2 is computed for all data and for each cluster as provided by our Local Biplot framework (see Figure 5).Cluster size is also depicted.Each cluster header is color-coded and ordered from highest to lowest Ȓ2 .

Regressor
All  We also establish correlations for each cluster separately and throughout the dataset.We also establish correlations for each cluster separately and throughout the dataset.

RiceClimaRemote Dataset Results
Figure 9 shows the visual inspection results on the RiceClimaRemote dataset.The basis (arrows) are over each projection.Shown also are 2D projections using CWC color.We present the absolute Pearson correlation between each of the 21 variables and the CWC (see Figures 10 and 11).Each panel displays correlations for individuals, for all irrigation treatments (ALL T), and for the three irrigation treatments (MIRI, AWD, CONTROL) across all dates, taking into account both the original input features and clustered samples.Next, Table 4 presents the Ȓ2 values for breeding score estimation.These models were trained using physiological parameters, Multiespectral, and RGB-based VIs for all data and for each cluster.Figure 12 displays the normalized input feature relevance analysis.
Table 4. RiceClimaRemote CWC prediction results.Regression performance (average ± standard deviation) regarding the Ȓ2 is computed for all data and for each cluster (see Figure 9).Cluster size is also depicted.Each cluster header is color-coded and ordered from highest to lowest Ȓ2 .

Regressor
All

Discussion
We introduced Local Biplot, a methodological framework designed to visually identify meaningful data patterns within localized contexts over multi-temporal crop data, particularly focusing on water status analysis.Our approach effectively captures data complexity and non-stationarity, enabling the identification and transformation of significant clusters within a common biplot framework for feature-sample contributions.
The results demonstrate that Local Biplot outperforms the traditional SVD-based biplot in identifying and preserving local structures.For instance, in the synthetic dataset, it is clear that the SVD-based embedding depicted in Figure 3 effectively separates the synthetic observations along both principal components.Variables contributing to PC2 have a significant influence on distinguishing the clusters.However, although the classical biplot effectively distinguishes the generated structures, it shows shortcomings in providing explicit insights into the influence of each feature on the respective point clouds.Our localbased biplot method, on the other hand, focuses on capturing local structures and nonlinear relationships in the data.It effectively shows the difference between the structures in the artificially created group samples.The data present a large variation on both axes, and the representation of each variable suggests that the discriminant information may vary between local-based analyses.For instance, while f4 and f5 remain correlated, this correlation breaks in cluster 2. Notably, the classical approach lacks explicit insights into the influence of each feature on the respective point clouds.Pattern variations are evident, as is the correlation change in the Local Biplot embedding.These discrepancies in correlation patterns correspond to different sample subsets produced by multivariate Gaussian distributions.We attribute this success to the combined use of UMAP, clustering, and local SVD decomposition, which preserves both local and global structures, thereby enhancing the ability to capture non-stationary patterns and nonlinearities in the input space.In turn, the Pearson correlation values for the Multivariate Gaussian dataset reveal pattern changes over the entire dataset and within clusters, as shown in Figure 4.The initial clustered data had modest dependencies, but Local Biplot-based correlations enhance them.In cluster 3, the correlation between variables f2 and f4 declined dramatically, while in cluster 1, it increased.Thus, our technique correctly recognized liner and non-linear sample relationships.
For the analysis of the Forage Grasses dataset, Local Biplot's finer resolution helped a lot in showing how differences between clusters were consistent (see Figure 5).This highlighted the role of visual-based indices in revealing these patterns and suggested potential sources of multicollinearity among indices from various color spaces.For example, the visual-based indices exhibit strong correlation and align with both the left cluster and the score.Thus, the visual appearance seems to play a crucial role in defining and separating the left cluster and the PC2 axis.The PC2 axis, instead, seems to be highly correlated with CIVE, CWSI, dT, a* and G-R.The right cluster displayed higher values on PC1 but exhibited considerable variation across varieties on PC2, suggesting a diverse range of characteristics within this group.Similarly, the left cluster showed variation in PC1.Notably, a relationship exists between both PCs: higher values on PC1 (associated with greener plants) correspond to higher values on PC2 for both clusters.
Further insights were obtained by calculating the absolute Pearson correlation between 37 indices and breeder score (see Figure 6).The high correlation value between the score and the visual-based indices in every cluster emphasizes the RGB color space's crucial significance in revealing data variability.The consistent patterns of variation depicted by the aligned arrows hint at potential sources of multicollinearity.The lengths of arrows in the biplot analysis indicate that cluster 1 (dark blue) prioritizes variables like VARI, MGRVI, and ExR, while cluster 2 (cyan) focuses on G-R, u*, and uv.Clusters 3 (yellow) and 4 (brown), on the other hand, place higher importance on G/R, GRVI, MGRVI, VARI, and ExR (demonstrating similar values).Furthermore, CWSI holds less significance in cluster 1, and BRVI is less important in cluster 2. Similarly, WI shows lesser significance for clusters 3 and 4.
Additionally, Local Biplot-based correlations in Figure 7 report significant insights into the relationship between VIs, cluster groups, and breeder scores.Each correlation panel, spanning all species (ALL V) and specific species (FA, Lp2n, Lp4n), provided a comprehensive view of feature dependencies considering both the original input data and clustered data.Interestingly, the analysis revealed lower correlations between breeder scores and certain VIs such as R, G, B, RCC, ExR, CIVE, a*, ab, u*, and uv across all clusters compared to the complete dataset.Nonetheless, some visual-based indices like GCC, ExG2, ExGR, GRVI, and G/R showed no clear cluster effect on the linear regressions with breeder scores, resulting in similar correlations across all species except for Lp4n in cluster 2.Moreover, H, NDLAB, and NDLuv exhibited consistent patterns with high correlations in clusters 2, 3, and 4. Notably, Lp2n in cluster 1 and Lp4n in cluster 3 demonstrated greater variation in correlations, mirroring the trends seen across all varieties in cluster 4.These findings are consistent with the relevance bars shown in Figure 8. Furthermore, the reported behaviors highlight the complex interplay between VIs, breeder scores, and genetic or environmental factors, underscoring the importance of detailed and contextual analysis for a comprehensive understanding of drought tolerance in the studied grass species.
Regarding the breeding score prediction, both the LR and RF models generally show similar Ȓ2 values across clusters and the entire dataset (see Table 3).Thus, both models perform comparably in terms of explaining the variability in the target variable based on the input features.Moreover, we observe that high Ȓ2 values are reported for the entire dataset compared to individual clusters.However, cluster 3 has the highest Ȓ2 , indicating better predictive performance.Similarly, clusters 2 and 4 exhibit similar predictions, indicating comparable model performance in capturing variability in the target variable within these clusters.In contrast, cluster 1 consistently presents the lowest performance, suggesting potential challenges in model performance.It is worth noting that despite cluster 1 having a larger cluster size, the model struggles with the imbalance in the target values, as seen in Figure 5.It is worth noting that our regression models on all the data outperform the approach presented in ref. [30], where Ȓ2 values were reported between the breeder score and individual VI's. Figure 8 shows the normalized feature relevance results.As shown, for the complete dataset, the SVD-based biplot provides higher relevance values for all variables than LR and RF relevance values.Note that the RF requires fewer features to achieve similar performance as the LR.
In turn, the examination of the RiceClimaRemote dataset revealed significant complexity and variability (see Figure 9).The SVD-based biplot analysis reveals that although PC1 and PC2 capture much of the data's variability, the sample points are more dispersed compared to the local-based biplot.GGA, R, B, G, and the highly correlated multispectral indices SR, NDVI, GNDVI, NDRE, and GVI primarily compose PC1.Similarly, PC2 is mainly associated with the NIR band, OSAVI, SAVI, Red Edge, and PRI.This structure suggests a complex data arrangement, with variables exhibiting a significant degree of variability, possibly indicating multiple subgroups or non-stationary patterns within the main group.In contrast, in the local-based biplot, the preservation of local structures leads to the formation of tighter clusters, highlighting subgroups.The resulting 2D non-linear projection (scaled to a range between 0 and 1) captures much of the large-scale global structure.Still, it also preserves the important local structure of the dataset, resulting in five tightly clustered groups.Furthermore, it is notable that multispectral indices such as SR, NDVI, GNDVI, NDRE, and GVI remain highly correlated across all clusters.Additionally, the embedding space underscores the temporal influence on the relationships between features, with distinct contours aligning with different rice growth stages.This temporal dimension is crucial for understanding seasonal variations and other time-dependent factors impacting the rice fields.
Correlation analyses (Figures 10 and 11) show significant positive correlations between CWC and physiological measurements like Pn and Gs, while Ci's correlation with CWC varies across clusters.Cluster-specific variations indicate that local data structures signifi-cantly influence these relationships, which are not uniformly captured in global analyses.The consistency of correlations among multispectral indices across all clusters suggests robust relationships that persist despite local variations.To understand the interactions among the features within the clusters obtained using our Local Biplot, Figure 11 displays the relationships CWC.This analysis revealed variations in the correlations, with some increasing and others decreasing.In fact, the Ȓ2 measures for CWC estimation presented in Table 4 reveal that cluster 1 (brown) exhibits the highest predictive accuracy for both LR ( Ȓ2 = 0.65) and RF ( Ȓ2 = 0.67) regressors, emphasizing the importance of local structures in improving model performance.Interestingly, cluster size does not directly correlate with Ȓ2 values, highlighting the complexity of the data and the influence of sample diversity on model performance.For example, cluster 5's (dark blue) small size hinders the identification of robust data patterns, leading to poor performance and contributing to problems of reproducibility.
Finally, the relevance analysis (Figure 12) indicates that fewer variables are needed for accurate CWC predictions in LR and RF models compared to SVD.The selection of significant spectral bands varies across models, with RF identifying a broader range of important features, contributing to higher prediction accuracy.The variability in feature importance across clusters further demonstrates the heterogeneous nature of the data and the necessity of tailored analysis approaches for different subgroups.These changes can be attributed to the clustering of the UMAP projection within the Local Biplot, which effectively captures the local structure and non-linear relationships.Our framework isolates subsets of points that share similar characteristics.

Conclusions
We introduced a methodological framework termed Local Biplot to discern meaningful data patterns within localized contexts, specifically focusing on water status analysis in crops.LocalBiplot captures non-linear and non-stationarity data relationships, allowing us to identify significant clusters for transformation and projection within a shared biplot.We applied a local affine transformation to map the input feature variability-based rankings to the 2D low-dimensional space, providing a complete picture of the local relationships between variables and samples.So, this framework includes the contributions of features and observations within each cluster in the same figure.This makes it easier to understand how data change over time and helps with evaluating variables related to crop water status.We tested our approach using both synthetic and real-world databases, including structured data from grass and rice crops.Our results show that Local Biplot outperforms the traditional SVD-based biplot in finding and preserving local structures.For example, in the synthetic dataset, our method accurately identified the distinct covariance structures of the three artificially generated cloud points.We attribute this success to the combined use of UMAP, clustering, and local SVD decomposition, which preserve both local and global structures, enhancing the ability to capture non-stationary patterns and nonlinearities in the input space.Furthermore, the method's application to Forage Grasses and RiceClimaRemote datasets has highlighted the utility of visual-based indices and the significant impact of temporal and treatment variations on the data.Our findings emphasize the importance of considering local structures and nonlinear relationships in data-driven precision agriculture.
As future work, extending the LocalBiplot into a deep learning approach is promising research, as demonstrated by the model introduced in ref. [54] to further improve predictive modeling accuracy and robustness.Our next step is to broaden the research to other crops and geographical regions to evaluate the generalizability of the findings [55].Different crops may exhibit unique data patterns and responses to environmental factors, necessitating tailored approaches for precision agriculture [14].Additionally, collaborating with agricultural practitioners and stakeholders will help validate the effectiveness of the proposed approaches in other practical settings.

Figure 3 .Figure 4 .
Figure 3. Multivariate Gaussians dataset visual inspection results.(Left): SVD-based biplot.(Right): Local Biplot (ours).Gray arrows depict each feature in the dataset (f1-f5), which shed light on their correlations.Examining the scatter points and their colors allows us to visually understand sample distributions.PC stands for principal component (basis).

Figure 5 .
Figure 5. Forage Grasses biplots.Left: SVD-based biplot.Middle: Local Biplot.Right: Local Biplot and cluster-based probability boundaries.The colors in the left and middle plots represent the clustering label.The right plot's color emphasizes the target variable (breeding score), while the flight dates (TF1, TF2, and TF3) determine the color of the curves.PC: principal component (basis).

Figure 6 .
Figure 6.Forage Grasses Pearson correlation results: SVD-based biplot.We show the absolute correlation between the VIs and the breeding score (target) for each species individually and collectively.We also establish correlations for each cluster separately and throughout the dataset.

Figure 7 .
Figure 7. Forage Grasses Pearson correlation results: Local Biplot (ours).We show the absolute correlation between the VIs and the breeding score (target) for each species individually and collectively.We also establish correlations for each cluster separately and throughout the dataset.

Figure 8 .
Figure 8. Forage Grasses feature relevance analysis.SVD-based biplot and Local Biplot (ours) normalized feature relevance are presented.We also show the LR and RF regressor weights.The bar color in the second column stands for the Local Biplot clusters labels (see Figure 5).

Figure 9 .Figure 10 .Figure 11 .
Figure 9. RiceClimaRemote biplot results.Left: SVD-based biplot.Middle: Local Biplot.Right: Local Biplot and cluster-based probability boundaries.The colors in the left and middle plots represent the clustering label.The right plot's color emphasizes the target variable (CWC), while the growth stages of rice (vegetative, reproductive, and ripening) determine the color of the curves.PC: principal component (basis).

Figure 12 .
Figure 12.RiceClimaRemote feature relevance analysis.SVD-based biplot and Local Biplot (ours) normalized feature relevance are presented.We also present the LR and RF regressor weights.The bar color in the second column stands for the Local Biplot clusters labels (see Figure9).

Table 1 .
Color space and vegetation indices employed for the Forage Grasses dataset.
R R+G+B