Spatial multivariate selection of climate indices for precipitation over India

Large-scale interdependent teleconnections influence precipitation at various spatio-temporal scales. Selecting the relevant climate indices based on geographical location is important. Therefore, this study focuses on the spatial multivariate selection of climate indices influencing precipitation variability over India, using the partial least square regression and variable importance of projection technique. 17 climate indices and gridded precipitation dataset (0.25 × 0.25°) from the Indian Meteorological Department for 1951–2020 at a monthly scale are considered. Results show that among all the indices, Nino 4, Nino 1 + 2, Trans Nino Index, Atlantic Multidecadal Oscillation (AMO), quasi-biennial oscillation (QBO), Arctic oscillation (AO), and North Atlantic Oscillation (NAO) have a significant influence on precipitation over India. Further, within homogenous regions, it is found that the Southern Oscillation Index and Nino 3.4 are selected majorly in the South Peninsular compared to other regions. The NAO/AO show a similar pattern and was found to be relevant in the Northeast region (>89%). AMO is selected mainly in Northwest, and West Central (>80%), AMO and QBO at about 70% of grid locations over Central Northeast India. It is to be noted that the number of climate indices identified varies spatially across the study region. Overall, the study highlights identifying the relevant climate indices would aid in developing improved predictive and parsimonious models for agriculture planning and water resources management


Introduction
Spatio-temporal variations in monsoon precipitation are mainly driven by geographic features, topography, climate change and variability. Especially in India, the western ghats and Northeast India receive a large amount of precipitation (>200 cm), central and Southern Peninsular India (between 60 and 100 cm), and Northwest India (<50 cm) (Rajeevan et al 2000, Kishore et al 2016. Studies show a non-uniform distribution and an increase/decrease in frequency and intensity of precipitation over India (Rajeevan et al 2012, Yadav et al 2013, Zhang and Zhou 2019. It is reported that the precipitation is strongly influenced by variations in ocean-atmospheric coupled interactions, measured as climate indices (Shi andWang 2019, Das et al 2020). There are multiple climate indices measured from the Pacific, Indian, and Atlantic oceans. These large-scale indices contain predictive information that helps to understand the precipitation variability over different monsoon regions. Several studies have attempted to investigate the effect of climate indices on precipitation in India and around the world. The relationship between the climate indices such as El Nino Southern Oscillation (ENSO), Indian Ocean Dipole (IOD), Southern Annular Mode (SAM), and Pacific Decadal Oscillation (PDO) and precipitation was examined on a monthly scale over South Australia (He and Guan 2013). The Indian summer monsoon precipitation was predicted based on ENSO and equatorial Indian Ocean oscillation and found that the monsoon prediction skill strongly depends on the skill prediction of climate indices (Surendran et al 2015). The effect of climate indices (ENSO, PDO, IOD, North Atlantic Oscillation (NAO)) on precipitation was examined over India and suggested that it would lead to better prediction of monsoon (Wang et al 2000, Krishnamurthy and Kinter 2003, Krishnamurthy and Krishnamurthy 2014. Over China, it is reported that the ENSO has a major influence on Northeastern and Central China, and PDO on East and South parts of China (Chang et al 2019). Similarly, the climate indices IOD, Sea Surface Temperature (SST), multivariate ENSO index (MEI), Southern Oscillation Index (SOI), PDO, NAO, and Arctic oscillation (AO) were found to have influence on precipitation over different regions of India (Das et al 2020). These studies found that teleconnections strongly affect the precipitation variability at various spatio-temporal scales.
In the studies, the climate indices considered are usually knowledge-driven/based on the subjective decision of the modeler. Although all indices are relevant and provide information about the precipitation, in contrast, too many teleconnections can affect/reduce the model's performance. Moreover, these are often interdependent, i.e. one teleconnection's influence on precipitation is likely modulated by other indices. There would be spatio-temporal variations in the influence of climate indices affecting precipitation, i.e. one climate index is predominant in some regions and may not be in the other. Therefore, it is vital to identify the appropriate climate indices based on the geographical location and type of study, accounting for the spatial variability of precipitation and reducing the dimensionality. We attempt to examine the variations of precipitation influenced by multiple climate indices to answer the following questions: (a) What are the dominant climate indices affecting precipitation over the Indian region? (b) Are there any regional patterns in the climate indices influencing precipitation?
The main objective of the study is the multivariate selection of climate indices over the Indian region based on the partial least square regression and variable importance of projection method (PLSR-VIP). This would help reduce dimensionality and select appropriate indices influencing the precipitation regionally. The paper has five sections: section 2 presents the data and methods used. The results and discussions of the proposed methodology are presented in section 3. The study's conclusions and further research scope are highlighted in section 4.

Study area
The study area covers the whole Indian region, with 6 • 44 ′ and 35 • 30 ′ N latitudes and 68 • 7 ′ and 97 • 25 ′ E longitudes. It is of large geographical size and has been classified into six distinct climate zones based on the Koppen climate classification system (Peel et al 2002). It is comprised of tropical wet (western ghats, Eastern and Western coastal area, West Bengal), tropical wet and dry, Semi-Arid (covers the largest part of India. Entire peninsula, parts of Northern India such as Punjab and Haryana), arid (North-Western India, comprising some parts of Rajasthan and Gujarat), humid subtropical (covers the states of Punjab, Bihar, Madhya Pradesh, and Orissa). The six homogeneous regions as defined by Indian Meteorological Department (IMD) are shown in figure 1. It consists of (a) Central Northeast, (b) hilly regions, (c) Northeast, (d) Northwest, (e) South Peninsular, and (f) West Central.

Climate data 2.2.1. Precipitation data
In our study, the observed gridded daily precipitation dataset of spatial resolution 0.25 × 0.25 degrees for 1951-2020 from IMD have been used. It is arranged in 135 × 129 grid points and constructed based on observational records from gauged stations based on the inverse distance weighted spatial interpolation (Shepard 1968). The large spatial density of gauge stations has brought out accurately the precipitation variability across India compared to other gridded datasets (Pai et al 2014). Studies have used this dataset for climatic downscaling and bias correction (Sehgal et al 2018, Smitha et al 2018, analyzing extreme precipitation (Vinnarasi and Dhanya 2016), and regionalization (Guntu et al 2020). Additionally, this data set can accurately depict the orographic influences making it highly credible to regional studies (Vinnarasi and Dhanya 2016). There is a total of 17415 grid locations, out of which 4964 are land grids for India. The monthly series at each grid location are extracted, and the analysis is carried out for land grids only.

Multivariate selection of climate indices
Several mathematical strategies are adopted in selecting the input variables/predictors when there is interdependency and a need to reduce the dimensionality. Especially, the multivariate technique, such as principal component analysis (PCA) and factor analysis, is widely used to select only a subset and eliminate highly redundant variables (Raziei 2018, Fouad andLoáiciga 2020). However, limitations of these techniques are: (a) not considering the response variable while selecting the input variables, (b) spatial variation is not represented, and (c) need for dimensionality reduction. This would lead to difficulty in interpreting the models spatially. This problem could be alleviated using the PLSR technique, which investigates the relationship between input variables (such as climate indices) and response variables (i.e. precipitation) (Wold et al 2001, Shawul et al 2019. Especially, when there are multiple climate indices and one response variable (precipitation), the interpretation of one multivariate model would be different from that of many univariate models. Our study adopts the PLSR technique for spatial multivariate selection of climate indices influencing precipitation over India.
The added value of the PLSR technique over PCA and other regression techniques are: (a) PLSR identifies the relevant climate indices influencing precipitation accounting the spatial variability. However, PCA is also an alternative method when large number of correlated independent predictor variables exist. It constructs components and explains the climate indices variability independent of precipitation and does not to capture the spatial variation, while PLSR considers the response variable (precipitation), leading to lesser components and provides parsimonious models with low complexity. It is to be noted that the PLSR focuses on covariance, and PCA on variance. For example, we have 17 climate indices and PCA selects the same indices at all the grid locations irrespective of their influence on precipitation as it does not account the precipitation/response variable. Whereas PLSR accounts the precipitation variability and selects only relevant climate indices influencing precipitation specific to the grid location. (b) Although multivariate regression can be used to model the relationship between multiple predictors and response variables, it cannot handle multicollinearity. The climate indices are usually interdependent and correlated, i.e. the influence of one teleconnection on precipitation can be modulated by other teleconnections. This can be addressed using the PLSR technique, which combines the features of PCA and multiple linear regression to reduce the dimensionality of correlated predictor variables accounting for the interdependency.

PLSR-VIP
The PLSR is a suitable multivariate regression technique to account for interdependencies between teleconnections and precipitation to improve model performance and prediction capabilities (Wold et al 2001, Shawul et al 2019. The relationship between teleconnections and precipitation is inferred based on the regression coefficient and weights. VIP is a filtering method for PLSR used to select the climate indices that strongly influence precipitation. Consider 'C' climate indices (n × p) as the matrix of predictor/input variables, and 'P' be the precipitation (n × r) as the matrix of response variable, 'n' being the number of observations, 'p' is the number of climate indices, and r is the univariate precipitation response.
The outer relationship between 'C' and 'P' with 'K' latent variables after decomposition are expressed by equations (1) and (2): where, T (n × K), and U (n × K) are the matrix of latent variable scores of 'C' and 'P' , respectively, 'L'(p × K) is the matrix of C-loadings, and 'M'(r × K) are the matrix of P-loadings, 'E' and 'F' represent the error terms associated with 'C' and 'P' respectively. The inner relationship between 'C' and 'P' is expressed by regression of P-score against the C-score and is shown in equation (3): where, u k and t k are the column vectors of scores for 'P' and 'C' respectively (equations (1) and (2)), 'b k ' is the regression coefficient of the inner relationship, and 'h' represents the error in the inner relation. The PLSR-VIP score for the jth climate index is calculated using the regression coefficient 'b k ' , weight vector 'w k ' , and score vector t k for 'p' number of climate indices using equation (4): The PLSR-VIP measures the relevance of individual climate indices by considering the covariance between the C (k) and P (k) , as expressed by (w 2 jk ), weighted by the proportion of P (k) that is explained by the kth dimension (b 2 k t T k t k ). The average of the square PLSR-VIP scores is equal to one; hence the 'VIP scores greater than one' rule is typically employed as a criterion for predictor selection. This criterion has been suggested in several studies (Afanador et al 2013, Mukherjee 2017) as a higher VIP score indicates a more significant contribution, while a lower VIP score indicates a lesser contribution. If all the VIP scores are equal to one, then all the variables have the same effect on the response variable. In this study, PLSR-VIP was implemented using MATLAB software.

Selection of climate indices
Using the PLSR-VIP approach, the study aims to examine the impact of climate indices on precipitation over the Indian subcontinent from 1951 to 2020. The climate indices and precipitation at a monthly scale are considered predictors and response variables, respectively. VIP scores based on PLSR weights are used to infer the quantitative relationship between individual climate indices and precipitation to determine the relatively significant covariates among all components. The climate indices (predictors) with VIP scores greater than one are considered to be significantly influencing precipitation. The relevant climate indices are selected among the 17 indices at each grid location over the study region. The VIP scores of climate indices at all 4964 grid locations are not presented for brevity. Instead, the VIP scores of the climate indices at a few grid locations (grid number-10, 500, 1200, 2000, 3600, 4800) are shown in figure 2. It can be observed that at grid location-1200 (figure 2(c)) NAO, AO, QBO, Nino 4, Nino 1 + 2, and TNI, whereas for another grid location-4800 (figure 2(f)) NAO, DMI, Nino 4, Nino 3, and TNI are selected. Similarly, a different set of climate indices that significantly influence the precipitation are identified at each of the grid locations.
The percentage grid locations at which each climate indices are selected (VIP > 1) is shown in figure 3. It is observed that the climate indices related to ENSO, Nino 4 (92%), Nino 1 + 2 (87.7%), and TNI (90%) are selected at a greater number of grid locations. In addition, the AMO (72.5%),   QBO (60%), AO (48%), NAO (29%), and GLBT (21.7%), are also selected majorly. In comparison, the other indices such as SOI (4%), PDO (13.5%), SAM (3%), DMI (7.2%), PNA (1.3%), Nino 3 (6.3%), MEI (4.5%), Nino 3.4 (3%) and ONI (2.6) are selected at lesser number of grid locations. The number of relevant climate indices selected at each grid location among 17 indices is shown in figure 4. It is observed that the number of identified climate indices at a grid location varies from minimum of one to maximum of ten. The five number of climate indices are selected at about 40% of grid locations, six indices at 27% of grid locations, four and seven indices at about 14% of the grid locations, and eight indices at about 2.5% of grid locations. This clearly indicates all 17 indices are not selected, only relevant climate indices influencing precipitation are identified at each grid location. The selected climate indices influencing the precipitation are non-uniform/heterogeneous throughout the study region. Among all indices, Nino 4, Nino 1 + 2, TNI, AMO, QBO, AO, NAO, and GLBT are identified at greater percentage of grid locations over India. Further, accounting these predominant climate indices would lead to simple and interpretable models.

Influence of climate indices in homogenous monsoon regions
The influence of climate indices on precipitation over India's six homogenous monsoon regions is investigated to understand the regional patterns. The descriptive statistic of monthly precipitation in each region is presented in table 1. The highest precipitation receiving places, such as the Western Ghats, and Meghalaya, fall under the West Central and the Northeast regions. The climate indices selected over the study region and within each homogenous monsoon region is shown in figures 5 and 6 respectively. It is observed that among all the climate indices, Nino 4, Nino 1 + 2, AMO, and  The ENSO comprises indices, namely SOI, Nino 4, Nino 3, Nino 1 + 2, Nino 3.4, ONI, MEI, and TNI. These indices represent the atmospheric-ocean fluctuations measured based on SST and pressure in the tropical Pacific (Zelle et al 2004, Shi andWang 2019). It is well known that the ENSO indices are strongly associated with Indian monsoon precipitation. The difference in temperature and pressure results in an increase/decrease in moisture transport to the landmass leading to floods/droughts. The ENSO cycle is thus the most prominent ocean-atmosphere event on interannual time scales with precipitation variability across India (Ashok et al 2004). Our results show that Nino 4, Nino 1 + 2, and TNI are selected at a greater number of grid locations over the study region (>80%). The SOI and Nino 3.4 are selected in the South Peninsular region (16%-18%) and less in other regions (0%-5%). The Southern regions The NAO and AO indices are reported to influence the seasonal variables across the globe over the northern hemisphere (Thompson and Wallace 1998). A few studies analyzed the influence of NAO/AO and reported an influence on precipitation over north India (Yadav et al 2013). In our study, NAO/AO show similar pattern and found to be relevant mainly in the Northeast region (>85%), hilly, and some parts of the South Peninsular regions. The PDO represents SST anomalies in the North Pacific, is the important oscillation in the North Hemisphere, and is reported to modulate the El Nino oscillations influencing the Indian precipitation variability (Krishnamurthy and Krishnamurthy 2014). Our findings suggest that it is predominant in the eastern coast of Southern India, Northeastern regions, and a few parts of hilly regions. The relationship between QBO and Indian precipitation is not explored much in the literature. The influence of PDO and QBO on the Indian summer monsoon is carried out and found that the phases of PDO and QBO may cause floods/drought events over India (Bhatla et al 2020). In our study, QBO is predominant in the Central Northeast, Northwest, and West Central (>60%).
Studies showed that the AMO strongly correlates with Monsoon precipitation (Kucharski et al 2007, Krishnamurthy andKrishnamurthy 2016). In our study, the AMO is predominant in all homogeneous monsoon regions except a few parts of the eastern coast and Northeast India. It has greater influence on precipitation in Northwest, and West Central (>80% of grid locations). The GLBT climate index represent global warming and was reported to influence precipitation changes (Trenberth 2012). The warm moist air is transported to land and drawn into the regional weather systems, leading to precipitation variability (Groisman et al 2005, IPCC 2007, Trenberth 2011. Moreover, IOD/DMI significantly influences the precipitation over India and weakens the impact of ENSO on summer monsoon whenever both occur with the same phases (Ashok et al 2001(Ashok et al , 2004. Our results show that SOI, PDO, DMI, Nino 3, and Nino 3.4 are mainly contributed to the precipitation variability in the eastern coast of South Peninsular India (figure 5).

Summary and conclusions
The PLSR-VIP technique is used for the multivariate selection of appropriate climate indices influencing the precipitation on a monthly scale. The study's significant findings are: (a) among all the indices, Nino 4, Nino 1 + 2, TNI, AMO, QBO, AO, and NAO have influence on precipitation at greater number of grid locations over the Indian region. A different set of climate indices are identified for each grid location, that significantly impacts the precipitation and is nonuniform throughout the study region, and (b) investigating the influence of climate indices in six homogenous regions, it is found that the Nino 4, Nino 1 + 2, and TNI are selected majorly in all regions. The AMO dominantly influences precipitation in Northwest, and West Central (>80%). The SOI and Nino 3.4 of tropical pacific are selected majorly in the South Peninsular region and lesser in other regions. NAO/AO show similar pattern and found to be relevant in the Northeast, hilly, and South Peninsular regions.
The outcomes from the study will help to understand the large-scale teleconnections effect on precipitation variability over India at each grid location, as a whole and regionally. Accounting the predominant climate indices for modeling would lead to simple and interpretable predictive models. The selected climate indices can be used as covariates for the non-stationary modeling of precipitation. This study would help reduce the subjectivity and drudgery involved in identifying climate indices associated with regional precipitation variability in modeling various water resource applications. However, this study mainly focuses on investigating the statistically relevant climate indices, and future studies may investigate the physical mechanism between climate indices and precipitation. Further studies can explore the compound effect of low-frequency climate indices and local attributes affecting the precipitation variability.

Data availability statement
The data that support the findings of this study are openly available. The teleconnection data is available online https://psl.noaa.gov/data/climateindices/ list/. The gridded rainfall data is available from www. imdpune.gov.in/Clim_Pred_LRF_New/Grided_ Data_Download.html. Nino 4 SST anomalies averaging over an area from 5 S to 5 N and 160 × 10 −150 W 10 Nino 3 SST anomalies averaging over an area from 5 S to 5 N and 150 W to 90 W 11 Nino 1 + 2 SST anomalies averaging over an area from 0 to 10 S and 90 W to 80 W 12 Atlantic Multidecadal Oscillation (AMO) Area-averaged SST in the Atlantic north 13 Global Averaged Temperature (GLBT) Averaged land-ocean temperature index over the globe 14 Multivariate ENSO Indices (MEI) Characterizes ENSO's intensity and is the first principal component of six different atmospheric parameters. 15 Nino 3.4 SST anomalies averaging over an area from 5 S to 5 N and 170 to 120 W 16 Oceanic Nino Index (ONI) Running mean (3 months) for SST changes over an area of 5 S-5 N and 170-120 W 17 Trans Nino Index (TNI) Running mean (5 months) of the difference in Nino 1 + 2 and Nino 4

References
Afanador N L, Tran T N and Buydens L M C 2013 Use of the bootstrap and permutation methods for a more robust variable importance in the projection metric for partial least squares regression Anal.