A statistical and numerical modeling approach for spatiotemporal reconstruction of glaciations in the Central Asian mountains

Reconstructing Quaternary regional glaciations throughout the Himalaya, Tibet, and the adjoining mountains in Central Asia is challenging due to geological biases towards limited preservation of glacial deposits and chronological uncertainties. Here, we offer several statistical and mathematical model codes in R, in excel, and in MATLAB useful to develop regional glacial chronostratigraphies, especially in areas with distinct orographically-modulated climate. A complete R code is provided to generate a regional climate map using Cluster Analysis (CA) and Principal Component Analysis (PCA). Additional R codes include reduced chi-squared, Chauvenet's criterion, radial plotter/abanico plot, finite mixture model, and Student's t-test. These methods are useful in reconstructing the timing of local and regional glacial chronologies. An excel code to calculate equilibrium-line altitudes (ELAs) and steps to reconstruct glacier hypsometry are also made available to further aid to our understanding of the extent of paleoglaciations. A MATLAB code of the linear glacier flow model is included to reconstruct paleotemperatures using the length and slope of a glacier during past advances.• R statistical codes can be used/modified without restrictions for other researchers.• Easy steps to calculate ELAs and glacier hypsometry from the same data.• Paleo-temperature reconstruction utilizes already developed glacial chronologies and maps.


Method details
With the increased chronological reconstructions over the past two decades throughout the glaciated mountains of Central Asia ( Fig. 1 ), it is now possible to use different statistical and numerical models effectively to reconstruct regional glacial stages [2 , 3 , 4] . Here, we offer R, Excel, and MATLAB codes of several visual, statistical, and numerical models that include Cluster Analysis (CA), Principal Component Analysis (PCA), reduced chi-squared test ( χ 2 ), Chauvenet's criterion, radial plotter, abanico plot, finite mixture model, Student's t -test, equilibrium-line altitudes (ELAs) and glacier hypsometry, and linear (inverse) glacier flow model. These methods may be broadly categorized into climate statistics (e.g., CA, PCA), age statistics (e.g., χ 2 , Chauvenet's criterion, radial plotter, abanico plot, finite mixture model, t -test), and morphometric models (e.g., ELAs, hypsometry, flow model). Our objectives are (i) to identify climatically distinct groups of glaciers that are modulated by topography and orography; and (ii) to develop robust regional glacial stages in each climatic region for spatiotemporal comparison. The codes presented here are straightforward and can be easily modified or adapted to a more complex computational environment and a wide variety of geological settings. These codes are also powerful to help extract both temporal and spatial aspects of paleoglaciations, offering critical insight into the regional pattern of paleoglaciations.
We aim that these methods and codes would help quantify chronological data, especially local and regional moraine ages, in a much easier and robust way to the new user. All the necessary data and templates are provided in the supplementary materials. The software programs used to generate the codes are either publicly available (e.g., R and R packages) or readily purchasable (e.g., Microsoft Excel, MATLAB). The only exception is the Read ArcGrid program, which could be obtained by request since it is not proprietary to this study.

Data extraction
Both CRU CL 2.0 (10 latitude/longitude) reanalysis temperature ( T in °C ) data ( https://crudata. uea.ac.uk/cru/data/hrg/tmc/ ) and TRMM derived precipitation ( PCP in mm ) data (4-km-horizontal x 250-m-vertical; http://www.geog.ucsb.edu/~bodo/TRMM/ ) are first rescaled to 18 × 18 km grid cell for  10 Be and 26 Al dated glaciated valleys across the Himalayan-Tibetan orogen. 10 Be and 26 Al dates of each studied valleys are color-coded based on the climatic zone map shown in Fig. 3 B. See Saha et al. [3] for more details. Age data were compiled from http://expage.github.io/data.html . each month in ArcGIS 10.5; averaged over 30 and 12 years, respectively. CRU CL 2.0 precipitation reanalysis data were not used due to high uncertainties for high-altitude places [5] . Approximately 42,511 glacier polygons (vector samples), derived from the Randolph Glacier Inventory (RGI 6.0), are then converted into points using the Feature to Point tool in ArcGIS 10.5 ( Fig. 2 ). Subsequently, climate data are extracted for each point (glacier/sample) from the gridded datasets using the Extract Multi Values to Points tool ( Fig. 2 ). Nine climate parameters are derived from the primary data for each glacier (Supplementary Table S1). These are: (iv) 5 Total summer (monsoon) pecipitation ( PCP s ( x ) = 4 x =1 P CP (x )) for JJAS (v) 6 Total winter (westerlies) precipitation ( PCP w ( x ) = 4 x =1 P CP (x )) for DJFM where PCP is mean annual precipitation [5] 8 Relative Entropy (RE = 12 m =1 P CP m (x ) log 2 (12 PCP m ( x ))) where m is a month and P CP m (x ) = P CP (x ) PCP [6] and 9 Dimensionless Seasonality Index ( DSI = RE PCP R ) (ix) Where R is a constant scaling factor introduced to make precipitation dimensionless. We choose R as the maximum of ( 12 x =1 P CP (x )) among all sample glaciers (see supplementary table S1).
SCI indicates relative precipitation variability in each month compared to the mean annual [6] . RE is likely associated with the number of 'wet' months and reaches its maximum value of log 2 12 when annual precipitation is concentrated in one single month [7] . According to definition (ix), DSI is zero when either PCP (completely dry location) or RE (PCP distributed uniformly throughout the year) are zero and maximum ( log 2 12 ) when R is concentrated in a single month [7] . Due to the nature, units, and dissimilar ranges of the climate parameters, all data, except DSI , are scaled to lie between 0 and 1, by diving by the maximum value in the sample population for respective parameters (e.g., T ( x ) /T ( x ) max . ); Supplementary Table S1).

Statistical model
We performed CA [8] using all nine climate parameters extracted using glaciers as points (Supplementary Table S1). CA is often favored to classify samples (e.g., glaciers) based on the degree of similarity among them given a defined set of parameters [9 , 8] . Euclidean (linear) distances between any pair of glaciers in a multidimensional (in our case no. of dimensions are nine) Cartesian space is calculated in R [10] . Using the unweighted pair-group method with arithmetic averaging (UPGMA; [11] ), finally, a cluster dendrogram is produced by grouping together the most similar (i.e., least linear distances) sets of glaciers ( Fig. 3 A). In addition, the significance of our clusters is tested using Pearson's product-moment correlation between the sample Euclidean and cophenetic distances [12] . A higher positive correlation is expected for our CA groupings to be validated [9] . We also performed analysis of similarities (ANOSIM; [13] ) between sample Euclidean distances and CA defined groups (with 600 permutations) to calculate R . Where the value of R (constrained between -1 to 1) is more positive, the samples within groups are more similar than would be expected by random chance [9] ( Fig. 3 B). A similar statistical model was successfully applied to Andean glaciers by Sagredo and Lowell [9] . In Saha et al. [3] , we conclude that this type of model equally works best for Himalayan-Tibetan glaciers and is therefore useful in comparison to the existing manual delineation based on arbitrary topographic ranges [2 , 14] .
Following is the R code for CA. The Excel sheet in the supplementary Table S1 titled "For_R" can be used as a template; the current R code is designed to read a comma-delimited ( * csv) file format. Note that the analyzed climate data provided in S1 is dense, and hence, the computation is performed at the Ohio Supercomputer Center ( [15] ; https://www.osc.edu/ ). Readers who intend to use/modify the code must reduce the data volume in Table S1 (sheet For_R) to try in the general computer system. # Cluster (CA) hierarchical model #-------------------------------------------------------------------# Setting up the directory and file input ## "Directory need to change if using in other computer" setwd("C:/users/folder_name/") # file input clima < -read.csv(file.choose()) # read the csv file head(clima) # for quick look of the file to see the variable names # Notes: the variables must be transformed into dimensionless values # using a maximum of each climatic variables; values range from 0 to 1 # Note: Log transformation can be used if the data is skewed; here no log transformation is attempted since that negatively affect thegrouping # Check the installation of these necessary packages #install package "vegan" install.packages("vegan") library(vegan) #load vegan package #install package "rgl" install.packages("rgl") The cluster dendrogram shows groups of most identical glaciers with similar climatic characteristics (see Section 1.1). Each terminal of the hierarchical tree represents a single glacier, and they are joined with each other based on a certain level of similarity; most similar glaciers are clustered together at each hierarchy. Five (climate) groups of glaciers are first identified using CA. (B) A climate zone map (superimposed over a hillshade map) is generated using the interpolation method and using each glacier's CA grouping. The dotted line represents the mid-latitudinal westerlies limit (from the west). Further subdivisions (defined as 'Climatic zones') of glaciers based on the meridional limit of westerlies are proposed in Saha et al. [3] . For example, group 1 is subdivided into climatic zone 1a (i.e., Arid and semiarid Transhimalaya, northwest Tibet, Pamir, and the Tian Shan) and zone 1b (i.e., Arid and semiarid southern and northeastern Tibet). Group 4 and 5, as identified in the CA dendrogram, are discarded from further analysis since they only represent end members in PCA analysis (C). Due to no 10 Be and 26 Al dates in group 3, no subdivisions are made. See Saha et al. [3] for a detail explanation of the scheme. The color-coding of circles in Fig. 1        CA results obtained using the R code were further evaluated using PCA ( Fig. 3 C). Such a comparison using two different statistical methods is not uncommon and proved successful in other fields [16] . We performed PCA to correlate climate parameters and represent them as a smaller set of uncorrelated (orthogonal) variables called principal components (PCs; Fig. 3 C) [9] . PCs geometrically capture the directions of maximum variation in the data [3] and therefore, are used to compare and refine our CA generated groups ( The summary method describes the importance of PCs.

Local glacial stages
True moraine ages or local glacial stages (e.g., Saha et al. [3] ) may be assessed using a variety of statistical methods. For example, reduced chi-squared ( χ 2 ) statistics (or mean square weighted deviation [MSWD]) is often applied to assess the distribution of apparent moraine ages [ 17 , 18 ]. Any age population with χ 2 > 1 indicates poor goodness of fit indicating large scatter/outliers; χ 2 > 1 but < 95% confidence interval indicates acceptable outliers or acceptable goodness of fit between observations and estimates. Outliers may also be detected and removed from the age population using other statistical techniques, including Chauvenet's criterion [19] . Since numerical ages are assumed to have normal distribution probabilities, Chauvenet's criterion is a powerful tool. Other quantitative methods commonly applied to detect outliers, but not shown here, include skewness [ 17 , 18 ], probability density estimates [20] , two standard deviations from the mean [21] , two mean absolute deviations from the median [22] , and generalized extreme Students deviates [23] . We encourage the readers to explore these methods as well to cross-validate their results. However, note that these are entirely statistical treatments that do not consider the background geological conditions. We therefore argue that readers must evaluate their statistically detected outliers with detail field observations before removing them from further analysis [3] .
Outlier free apparent moraine ages may be used to estimate mean moraine ages or true moraine ages [ 17 , 18 ] or local glacial stages [3] . Several popular measures of central values are the arithmetic mean ± standard deviation (sd), weighted mean ± weighted sd, and median ± sd ( Fig. 4 ) [ 17 , 18 ]. The apparent ages can be visualized through the kernel density (relative probability) plots and histogram ( Fig. 4 ). Additionally, skewness and kurtosis may also be useful to infer whether ages have longer The sample age with χ 2 > 2 (i.e., ~19 ka) is identified as an outlier and must be removed from further analysis. Here, the summary statistics include the outlier. Readers must note that the summary statistics must be used on outlier free distribution for publication purposes. tail distribution or positively skewed (possibly inheritance) or younger tail distribution or negatively skewed (possibly incomplete exposure) (see [17,18] for further explanation).
Note that most of the statistical methods stated here work for ≥3 apparent ages on a moraine. Since chronological studies often use limited discrete ages/samples, cautioned must be performed in applying these statistical techniques and interpreting the ages. Below is a compilation of the statistical methods stated above in R modified from different sources. For example, the χ 2 test is modified after Applegate et al. [18] ; Chauvenet's criterion after Taylor [19] ; R luminescence package from https: //CRAN.R-project.org/ package = Luminescence. Supplementary Table S2 can be used as a template.

Regional glacial stages
Once local glacial stages are defined and organized according to the climatic groups/zones discussed in Section 1, discrete regional glacial stages may be identified using a variety of visual and statistical techniques. For example, Radial plots ( Fig. 5 A) [25] and Abanico plots ( Fig 5 B) [26] are visually useful to identify distinct subpopulations. These subpopulations can also be detected quantitatively using the finite mixture model ( Fig. 5 C) [ 25 , 27 ] and Student's t -test [28] . The following R codes incorporate these visual and statistical tools [3] . Use Table S2 as a template where the first column is the mean moraine age in ka, and the 2nd column is ± 1 σ (internal error) in ka.

Equilibrium-line altitudes (ELAs) and glacier hypsometry
Osmaston [29] detailed the steps to estimate ELAs using area-altitude (AA), area accumulation ratio (AAR), and toe-headwall accumulation ratio (THAR) methods. Choosing the correct ratios (or a combination) for AAR and THAR are dependent primarily, among other things, on the topography, size of the glacier, and microclimate [30] . Saha et al. [3] , e.g., obtained AAR and THAR ratios from literature for each climatic group (see Section 3.3 in Saha et al. [3] for details).
We mapped present and past glaciated areas from Google Earth and Landsat ETM + images in ArcGIS 10.5 ( Fig. 6 A; see also supplementary file S3) [31] . Those maps (vector layers) were used to extract Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) global DEMs using the 'Extract by Mask' tool in ArcGIS 10.5 ( Fig. 6 B). Extracted (raster) DEMs were converted into ASCII files to aid the subsequent steps. The raster DEMs were also converted into triangular irregular networks ('TIN' in ArcGIS) layers to calculate the 3D surfaces of glaciers ( Fig. 6 C). Glaciated surface areas were estimated using the TIN layer and 'Interpolate Polygon to Multipatch' tool in ArcGIS. We also generated glacier's hypsometry [32] using the previously generated ASCII file (see supplementary Fig. S5 in Saha et al. [3] ). The Read ArcGrid program developed by Professor David Nash of the University of Cincinnati was used to generate the glacier's hypsometry. The Read ArcGrid program is a simple java program that calculates the Elevation Relief Ratio (hypsometric integral) for a matrix of elevations using Pike and Wilson's method [32] . The hypsometric integral data and glaciated surface area can be directly copied and pasted into the ELAs excel file (highlighted in yellow in the supplementary Table S4). The excel file (Table S4 ELA calculation table) is prepared using the steps and recommendations of Osmaston [29] .

The linear inverse glacier flow model
We reconstructed paleotemperature changes relative to present from the linear change in glacier length using the first-order glacier dynamics model of Oerlemans [ 33 , 34 ] for each glacier. Since this simple linear inverse glacier flow model requires a limited number of variables, it is easy to reconstruct for a wide range of Central Asian glaciers. For example, the essential model variables are: present glacier (medial) length (L 0 in m), past (medial) lengths (L in m) based on the maximum extent of preserved moraines in the valley ( Fig. 6 A), mean slope of the glacier surface (s in%), present mean annual precipitation (P in m/year), time of moraine formation (t in a or year), and age uncertainties (e in a or year). Since the linear flow model considers climate sensitivity ( с ) and response time ( τ ) of the glacier to estimate the lag time, we argue that the model is a decent approximation for widely different climatic regions throughout the orogen.