Calibrated multi‐model ensemble seasonal prediction of Bangladesh summer monsoon rainfall

Bangladesh summer monsoon rainfall (BSMR), typically from June through September (JJAS), represents the main source of water for multiple sectors. However, its high spatial and interannual variability makes the seasonal prediction of BSMR crucial for building resilience to natural disasters and for food security in a climate‐risk‐prone country. This study describes the development and implementation of an objective system for the seasonal forecasting of BSMR, recently adopted by the Bangladesh Meteorological Department (BMD). The approach is based on the use of a calibrated multi‐model ensemble (CMME) of seven state‐of‐the‐art general circulation models (GCMs) from the North American Multi‐Model Ensemble project. The lead‐1 (initial conditions of May for forecasting JJAS total rainfall) hindcasts (spanning 1982–2010) and forecasts (spanning 2011–2018) of seasonal total rainfall for the JJAS season from these seven GCMs were used. A canonical correlation analysis (CCA) regression is used to calibrate the raw GCMs outputs against observations, which are then combined with equal weight to generate final CMME predictions. Results show, compared to individual calibrated GCMs and uncalibrated MME, that the CCA‐based calibration generates significant improvements over individual raw GCM in terms of the magnitude of systematic errors, Spearman's correlation coefficients, and generalised discrimination scores over most of Bangladesh areas, especially in the northern part of the country. Since October 2019, the BMD has been issuing real‐time seasonal rainfall forecasts using this new forecast system.


| INTRODUCTION
Located in sub-tropical South Asia, Bangladesh is one of the world's most densely populated countries.Bangladesh is characterised by a tropical monsoon-type climate, with a warm and rainy summer, and a pronounced dry season in winter, features that make it highly vulnerable to the effects of interannual climate variability (Rahman & Lateh, 2015) and change (Huq, 2001).The country experiences an unimodal rainfall distribution, with most of the rainfall typically concentrated from June through September (JJAS).During this period, Bangladesh receives about 70% of the total annual rainfall, with coefficient of variability that has been quantified around 12% (Ahasan et al., 2010).The pattern of the Bangladesh's summer monsoon rainfall (BSMR) is highly variable spatially, exhibiting a general west-east climatological gradient in annual rainfall ranging from 1500 to 4400 mm (Montes, Acharya, Hassan, & Krupnik, 2021;Nashwan et al., 2019).This pattern of variability strongly shapes human livelihoods, especially in agriculture, which is a mainstay of the country's economy.For instance, crop management decisions and production losses often occur in Bangladesh as a consequence of early or later arrival of rains, along with excess or deficient monsoon rainfall amounts (Nahar et al., 2018).Consequently, reliable BSMR forecasting at actionable time scales could potentially play a significant role in the planning and management of agriculture and other activities such as flood management, urban planning, water-resource management and optimal operation of irrigation systems (Hansen et al., 2006;Montes, Acharya, Stiller-Reeve, Kelley, & Hassan, 2021).
Seasonal climate-prediction efforts in Bangladesh have been based mostly on statistical and empirical forecasting methods using Auto-Regressive Integrated Moving Average (ARIMA) models for rainfall and temperature prediction (Bari et al., 2015;Mahmud et al., 2017;Mohsin et al., 2012;Rahman & Lateh, 2015) or regression models of the teleconnections between rainfall and various predictors such as sea-surface temperature (SST; Hossain et al., 2019;Mannan et al., 2015;Rahman et al., 2013a).ARIMA models have been used to predict rainfall with lead times of up to 12 months (Mahmud et al., 2017), but the lack of statistical significance of year-to-year autocorrelation can lead to limited forecasting skills (Dahale & Singh, 1993).A more widely-used approach has been the use of empirical relationships between observed BSMR and predictors such as sea-surface temperature, surface air temperature and pressure gradients (Hossain et al., 2019;Rahman et al., 2013b).For instance, the prediction of the monthly and seasonal frequency of rainy days and heavy rainfall events have been attempted using SST as a predictor (Mannan et al., 2015), and skill is higher than for the monsoon seasonal total amount, consistent with results from other parts of the world (Robertson et al., 2009).Nevertheless, the relatively weak teleconnection between sources of seasonal predictability such as El Niño-Southern Oscillation (ENSO) and seasonal climate in Bangladesh strongly limits the skill of these rainfall forecasts compared to other parts of the globe (Ahmed et al., 2017;Cash et al., 2017;Hossain et al., 2019).Due to the complexity of the diverse climate interactions in the vicinity of Bangladesh, non-linear and data-driven forecasting methods, such as artificial neural networks, adaptive neuro-fuzzy inference systems (ANFIS) and genetic algorithms, may have some advantages over linear methods (Banik et al., 2009) if sufficiently long time-series are available.
State-of-the-art general circulation models (GCMs) that represent atmospheric processes provide an alternative non-linear physically-based approach to statistical modelling (Kang et al., 2004;Kang and Shukla, 2005).This approach may produce more accurate and reliable climate predictions compared to statistical models based on empirical relationships (mostly linear) from observational data (Barnston & Tippett, 2017).However, predictions from GCMs often require correction due to their inherent systematic biases (Acharya et al., 2013;Tippett et al., 2007;Wilks, 2002).Calibration methods can be used to modify the amplitudes of large-scale patterns, and also to refine the details of anomaly patterns for local downscaling (Acharya et al., 2021;Barnston & Tippett, 2017;Doblas-Reyes et al., 2005;Tippett et al., 2008;Wilks, 2017).In this sense, multiple efforts have been carried out in order to quantify the improvements in skill from GCMs after calibration over different regions worldwide.However, in Bangladesh, these efforts have focused on single-location approaches but not at the country level (e.g., Montes et al., 2022).
Officially, the Bangladesh Meteorological Department (BMD) is responsible for providing operational seasonal and monthly monsoon climate predictions to climate information users.BMD has used a subjective consensus approach based on meteorologists' experience to generate products using all available Global Producing Center's forecasts and other available information.This subjectivelybased forecasting approach, however, has been found to be a poor fit for many decision makers interested in more reliable and objective forecasts.There is an increasing demand for high-resolution seasonal forecasts over Bangladesh at sufficient lead times to allow response planning from users in agriculture, hydrology, disaster management, energy, health, and other sectors.This demand has prompted the research for the development of an objective seasonal forecast system following the World Meteorological Organization's (WMO) recently published seasonal-forecast guidance (World Meteorological Organization (WMO), 2020).The guidance advocates the use of an objective seasonal forecast procedure, defined as a traceable, reproducible, and welldocumented set of steps that allows the quantification of forecast quality.The WMO has started to promote the adoption of such objective-based forecasting methods at Regional Climate Outlook Forums (WMO, 2017(WMO, , 2020) ) and by National Meteorological and Hydrological Services.In response, an objective forecasting system was developed for seasonal forecasting for Bangladesh, similar to others recently developed around the world (Acharya, Dinku, et al., 2020;Acharya et al., 2021;IRI, 2020).This advanced forecast system enables calibration, combination, and verification of objective climate forecasts from the state-of-theart GCMs of the North American Multi-Model Ensemble (NMME) project, and positions BMD to generate and deliver targeted climate information products that could be made relevant to the needs of decision-makers.Although multi-model-based methods have been explored for the Indian subcontinent (Acharya, Kar, et al., 2011;Kar et al., 2012;Rajeevan et al., 2012), this is the first time, to our knowledge, that they have been used for Bangladesh at the country level, aligned with BMD needs.As of October 2019, this new forecast system is used in real-time by the BMD (http://live.bmd.gov.bd/p/ThreeMonth283/).Therefore, from an operational perspective, the potential benefits of this new forecasting system need to be assessed in terms of hindcast skill assessment.
In this article, we describe the development and performance of an objective forecasting system which is based on calibrated multi-model ensemble (CMME) system in the seasonal prediction of BSMR and compare its performance with uncalibrated GCMs.The paper is organised as follows: in Section 2, we briefly describe the data used in this study, including NMME GCMs and the observational reference; in Section 3, we explain the procedures of the proposed canonical correlation analysis (CCA)-based calibration methods and illustrate how the methods are employed in practice to make CMME-based forecasts.In Section 4, we examine the performance of calibrated individual model outputs compared to that of uncalibrated outputs, along with validation of the CMME system compared; in Section 5, we provide a brief discussion and draw conclusions.

| Observational reference
Developed by the Columbia University's International Research Institute for Climate and Society (IRI) and BMD, the latest Enhancing National Climate Services for Bangladesh Meteorological Department (ENACTS-BMD) dataset (Acharya, Faniriantsoa, et al., 2020) version is used in this study.The ENACTS-BMD dataset is a highresolution (0.05 × 0.05 ) daily gridded rainfall and temperature dataset constructed by blending data from BMD weather stations, satellite products (for rainfall) and reanalysis data (for temperature).Since February 2020, BMD archives and maintains this dataset.Its record begins in January 1981 and is ongoing (updated every month in real-time) at daily, decadal and monthly temporal resolutions.For constructing gridded rainfall, BMD data from almost entire country's weather stations (54) are merged with rainfall estimates from the Climate Hazards Group InfraRed Precipitation (CHIRP; Funk et al., 2015).Compared with other available gridded precipitation products, ENACTS-BMD performs better in terms of monsoon total rainfall (Montes, Acharya, & Hassan, 2021).In this study, seasonal total rainfall for the period June through September (JJAS) are accumulated from daily data for the years 1982 to 2018. Figure 1 presents the climatology, interannual standard deviation and first empirical orthogonal function which (explains 44% of total variance) of total JJAS rainfall from ENACTS-BMD product during the study period.

| GCM hindcasts and forecasts
Hindcasts and forecasts from seven GCMs belonging to the NMME project phase 2 (Kirtman et al., 2014) were used in this study (details of each model can be found in the corresponding reference in Table 1).The NMME project coordinates intra-seasonal to interannual climate predictions from climate-modelling centres in the United States and Environment Canada.The NMME products provide opportunities to characterise and quantify the uncertainty associated with model structure and initial conditions using a large number of contributing models, each consisting of several ensemble members.The lead-1 (initial conditions of May for forecasting JJAS total rainfall) hindcasts (spanning 1982-2010) and forecasts (spanning 2011-2018) of seasonal total rainfall for the JJAS season from these seven GCMs were used.As the statistical post-processing process, especially CCA, required longer training sample, we have combined hindcast (29 years) and forecast runs (8 years; altogether 37 years) from these models, under the assumption that the hindcasts and forecasts are consistent with each other.These models have different number of ensemble members that were averaged to generate an ensemble mean and having a common 1 resolution spatial grid.These NMME monthly hindcast and forecast datasets were obtained from the Columbia University's International Research Institute's data library (http://iridl.ldeo.columbia.edu/SOURCES/.Models/.NMME/).

| METHODOLOGY
As described in Section 1, we used a calibrated multimodel ensemble (CMME) approach.This approach involves calibrating individual GCMs using canonical correlation analysis (CCA) based regression and assessing their skill against raw GCM outputs.The calibrated GCMs are averaged (equal weighting) to make a final CMME time series.The CMME-based forecast is subsequently compared against observations to assess its performance in relation to the uncalibrated forecasts.The processing chain is summarised in the flow chart presented in Figure 2.

| CCA-based calibration
CCA is widely used for calibration of forecasts from GCMs, for which the spatio-temporal patterns of GCM rainfall are projected onto the observed patterns (Barnston & Tippett, 2017;Tippett et al., 2007Tippett et al., , 2008)).CCA is basically a multivariate linear regression method allowing the identification of a sequence of pairs of patterns in two multivariate data sets, to then construct a set of transformed variables by projecting the original data onto these patterns.Correlations between the pairs of canonical variates, which are the transformed variables generated from truncated empirical orthogonal functions (EOF) or principal components (PC) of anomalies of predictor and predictand data, are called canonical correlations.Linear regression between predictand-predictor canonical variates is used for the forecast.Finally, the predicted values are recovered by EOF synthesis and reconstructed from the predictand means and standard deviations.More details of CCA method can be found in Wilks (2020).
The CCA-based calibration has been carried out separately for ensemble mean of each GCM prior to producing multi-model ensembles.The full procedure consists of the following sequential steps: • At the outset, observed rainfall was transformed to Gaussian by fitting a Gamma distribution.From estimates of the shape and scale parameters, the mean and variance of the corresponding Gaussian distribution are given in closed form.• As pre-orthogonalisation, CCA requires truncation of the EOF or PC expansions of the GCM (the predictor) and on the corresponding observations (the predictand).To avoid overfitting due to small sample size to train CCA, we have pre-selected five PCs for GCM and observation.The total variance explained by 5 PC is 92% for observation and for GCMs it is on average 85% (as there are different GCMs).
• In CCA, the predictor domain is usually designed to be larger than the predictand domain, so that relevant features outside of the predictand domain can be used for better model calibration (Barnston & Tippett, 2017).Therefore, the spatial domains for the GCM predictor fields were taken to be 15 N-35 N, 80 E-100 E, and all the ENACTS-BMD grid points within Bangladesh (Figure 3) were considered as our predictand (Figure 3).• The CCA model was trained using a leave-5-out crossvalidation in the 37 years of dataset  in which 5 consecutive years are retained from both the pre-EOF and the CCA training sample from GCM and observation, and the middle year of the 5 is predicted.
The years withheld progress from the earliest 5 to the latest 5 in which the first and the last years are also predicted so that each year has a cross-validated prediction.• Finally, the cross-validated series for the predictand variable is generated for 37 years and then validated against the observed rainfall data using skill scores mentioned in the next section.

| Calibrated multi-model ensemble
Previous works have shown that the use of multi-model ensemble (MME) approaches improves the forecast skills from individual GCM (Acharya, Kar, et al., 2011;Acharya et al., 2014;Kar et al., 2012;Krishnamurti et al., 2009).In general, an MME can be generated by combining equally weighted ensemble members or weighted according to their prior performance (Acharya, Kar, et al., 2011;Kar et al., 2012;Wang et al., 2019;Weigel et al., 2008;Weigel et al., 2010).Studies shown that performance-based weighting does not bring significant differences compared to the equal weighting to make MME based on calibrated GCMs (Wang et al., 2019;Weigel et al., 2008).In this work, equally weighted calibrated GCMs were used to generate the MME following Acharya et al. (2021).

| Verification metrics
To examine the skill of uncalibrated, calibrated GCM and MME forecasts, two commonly used forecast verification metrics, that is, the Root Mean Square Error (RMSE), which corresponds to the average squared difference between the forecast and observation pairs, and the Spearman rank correlation coefficient, which is the Pearson's product-moment correlation on the ranked values for each variable.In Spearman's rank correlation, a monotonic relationship between two variables is an important underlying assumption and is less restrictive than that of a linear relationship, which must be met by Pearson's correlation.We also employed the 'generalised discrimination score', also known as 'two alternative forced-choice score' (2AFC score; Mason & Weigel, 2009).The 2AFC score measures the proportion or probability of a correct decision of all available pairs of observations of a differing category whose forecasts are discriminated in the correct direction.The score ranges between 0% and 100% while any value higher than 50% implies that the forecast is able to discriminate beyond random guessing.These verification measures are used in this study as they are recommended by WMO standardised verification system for long-range forecasts for skill assessment (WMO, 2018).

| Skill of uncalibrated GCM predictions
Before assessing the skill of the CMME-based prediction, the performance of each individual GCM was analysed.
A Taylor diagram (Taylor, 2001) summarising the country-averaged performance of total JJAS rainfall predicted by each GCM is displayed in Figure 4a.Each of the models does not perform well in terms of correlation with the observations, which varies between −0.3 and close to zero.Observed standard deviations are largely underestimated by the GCMs, which range from around 50 mm to 100 mm, with root mean square differences between 230 and 300 mm.In general, these models performed poorly in reproducing the observed variability in JJAS rainfall over Bangladesh.This performance is in agreement with a recent study by Kelley et al. (2020), where the skill of NMME models were examined in the context of sub-seasonal metrics prediction, and which described low-to-modest skill in predicting seasonal rainfall in Bangladesh.These differences may be related to the model's coarser spatial resolution (1 × 1 grid) compared to higher resolution observed data (0.05 × 0.05 grid).Although largescale anomalies can be predicted at such coarse resolution, details on rainfall heterogeneity over Bangladesh could not be resolved, which suggests that the downscaling of GCM outputs can be highly important.A possible hypothesis for this poor performance by GCMs that has been described as driving bias in GCMs forecasting is the oversensitivity of GCMs to El Niño-Southern Oscillation (ENSO)-rainfall teleconnections (Acharya, Kar, et al., 2011;Pillai et al., 2018;Singh et al., 2019).To investigate this possibility, Pearson's correlation coefficients between areaaveraged seasonal total rainfall over Bangladesh and global sea surface temperature (SST) have been computed for observed and predicted rain and SSTs in each model (Figure 5).In observation, the ENSO-rainfall teleconnection is found to be positive although it is not statistically significant.Rahman et al. (2013b) found the similar positive ENSO-rainfall teleconnection using observations from 1985 to 2008.In contrast, the ENSO-rainfall teleconnections in most of the GCMs indicate a strongly negative relationship, indicating that GCMs are unable to reproduce the observed teleconnections satisfactorily, even of an opposite sign.Although the CCSM4 model showed the similar signal (positively correlated) of teleconnection pattern as observed teleconnection pattern, the magnitude of correlation is highly positive and statistically significant.Previous studies evaluating NMME models for Indian monsoon also found that the ENSO-rainfall teleconnections in the GCMs are stronger than in the observation which is a potential reason for GCM's poor performances to simulate monsoon rainfall (Pillai et al., 2018;Singh et al., 2019).Additionally, studies also shown that the seasonal prediction of northeastern Indian region including Bangladesh is very challenging due its positive ENSO (out-of-phase) teleconnection whereas the major part of Indian subcontinent has a negative relationship with ENSO (Choudhury et al., 2019;Saha et al., 2019).However, most of the GCM's cannot distinguish the out-of-phase relationship and having negative teleconnection with ENSO for the monsoon over entire Indian subcontinent.Other hypotheses of poor predictability by GCM can be drawn from the potential predictability (PP) analysis.Although there is a myriad of possible ways to estimate PP, we consider signal-to-noise ratio (SNR) to evaluate the predictive power of the models where the individual ensemble members from each of the models are taken into consideration (Figure 6).The SNR is used in several studies for the quantification of the predictive power of GCMs (Attada et al., 2022;Kang et al., 2004;Nair et al., 2013;Singh et al., 2012) for the Indian summer monsoon season.The SNR is defined as the ratio of external and internal variability where the external component is obtained as the variance of the ensemble mean and the internal component can be evaluated as the variance of noise (deviation of members from the ensemble mean).This implies that the larger the SNR, the better the predictive power.It can be noticed from Figure 6 that the except for the NASA-GEOSS2S, most of the GCMs (CanSIPSv2, GFDL-CM2p5-FLOR-A06, GFDL-CM2p5-FLOR-B01, COLA-RSMAS-CCSM4, and NCEP-CFSv2) has SNR within 0-0.2 range which represents a very weak predictability (external variance is 0%-4%).These lower SNR values explain the predictability limit for each GCM.This inability underscores the importance of calibration methods to partially or wholly remove systematic biases before computing a multi-model ensemble-based forecast.As described in Section 3.1, CCA-based calibration is useful in this regard as it projects the GCM rainfall onto the observed spatio-temporal patterns.

| Skill of calibrated GCMs predictions
The Taylor diagram of calibrated GCMs of Figure 4b shows that after calibration the root mean square differences range from 200 to 230 mm, representing an improvement over the uncalibrated GCMs.Moreover, it is also noticed the correlation also improved after calibration.For instance, the highly negative correlations between observations and models such as NASA-GEOSS2S and GFDL-CM2p1-aer04 become positively correlated after calibration.To examine the performance of the CCA-based calibration method at grid point scale, RMSE, Spearman's correlation coefficients and 2AFC scores are computed before and after calibration for each NMME model.For uncalibrated models, we interpolated GCMs to the ENACTS-BMD's resolution for a fair comparison as CCA produced the same resolution products as ENACTS-BMD.As similar results are found for all NMME models, we selected GFDL-CM2p5-aer04, NASA-GEOSS2S and NCEP-CFSv2 models for illustrative purposes.The north-eastern and south-eastern portions of Bangladesh exhibit the highest RMSE, which correspond to the rainiest areas of the country (Figure 7).Notably, the calibration reduces the RMSE, with values below 200 mm over most of the country, except for rainier regions where RMSE is around 300 mm for most models.Calibrated models show higher skill in terms of correlation for most of the country area (Figure 8).The correlation coefficients of GFDL-CM2p5-aer04 and NASA-GEOSS2S before calibration are mostly negative, but in general, improved after calibration, except in south-eastern Bangladesh.Over southern and eastern parts of the country, NCEP-CFSv2 correlations turn from negative to positive.Also, positive correlations in the north are similar before and after calibration.For all models, the 2AFC score also improved: areas where 2AFC was less than 50% for uncalibrated model outputs became higher than 50% after calibration (Figure 9).Moreover, the spatial pattern of improvement is similar for 2AFC scores and Spearman's correlation coefficients.
In general, the CCA-based calibration improves the forecast skill of uncalibrated models.Moreover, when RMSE is used as the verification metric, CCA calibration appears to improve the forecasting skill strongly, but correlation or 2AFC score does not consistently improve in every case, especially where the models show poor skill in the uncalibrated version, such as the case of NASA-GEOSS2S, which can be explained by the limited sample data to train the CCA.

| Skill of calibrated multi-model ensemble
To assess the performance of CMME, its skill is compared with uncalibrated MME, namely, UMME (averaging uncalibrated individual model) and presented in Figure 10.The skill of the UMME can be used as a benchmark.In general, CMME outperformed UMME in all skill scores.The RMSE is much lower in CMME, especially in north and south-eastern Bangladesh.Considering Spearman's correlation coefficient, UMME shows positive values only over a small area in the northern and drier areas of Bangladesh, whereas CMME shows widespread positive values except over a small area in the more mountainous southeastern part of the country where the correlations are close to zero or slightly negative.In addition, CMME Spearman's correlation coefficients are higher compared to most calibrated individual models.In terms of the 2AFC score, Figure 9c shows that values higher than 50% are dominant in CMME, except for the same region over the southeast.These results suggest an overall improvement of skill in BSMR prediction when CMME is used; however, high within-country differences are also observed, which can be associated with the complex local-scale precipitation mechanisms and the high spatial variability in climatological rainfall in Bangladesh.

| CONCLUDING REMARKS
This study aimed to develop an improved seasonal forecast system based on calibrated multi-model ensemble for the prediction of BSMR.For this purpose, we developed a hybrid dynamical-statistical technique using state-of-the-art GCMs from the NMME project.The individual GCM's seasonal predictions have been calibrated using a CCA approach to correct large systematic biases.These calibrated individual model predictions were then combined with equal weighting to obtain the final CMME forecast.Although similar multi-model prediction approaches have been used extensively, to the best of our knowledge, this is the first time that it has been used to produce seasonal forecasts of the BSMR.Since October 2019, each month this CMME-based forecast is prepared in real-time by the BMD for the next season.Therefore, from an operational perspective, the potential benefits of such a forecasting system need to be illustrated and documented in terms of the gain in quality of forecasts in realtime.Although this study only focuses on the skill of this forecast system for the summer monsoon season as the primary period of precipitation in Bangladesh, additional research should also document the predictability of preand post-monsoon precipitation, as well the applicability of our predictions for practical climate services in Bangladesh.
In conclusion, we found that although GCMs provide a solid non-linear approach to alternative statistical modelling to predict the BSMR, the calibration of models is necessary to generate operational forecasts given the strong model biases over Bangladesh.The biased performance of GCMs may be partly related to the model's coarse spatial resolution, their over-sensitivity to SSTrainfall teleconnections and lower signal-to-noise ratio which explains the predictability limit.Our results strongly indicate that CCA-based calibration can generate significant improvements that reduce the magnitude of systematic errors (RMSE) compared to individual uncalibrated models.Calibration also appears to improve Spearman's correlation coefficients and 2AFC scores over most of Bangladesh, exempting a few locations in the north-and south-east of the country.In conclusion, our analysis demonstrates that the skill of CMME is much better than the UMME and in comparison, to individual calibrated models, especially in the northern part of the country.However, due to limited sample data to train the CCA (32 years; using leave-5-out cross-validation in 37 years of hindcast data), further room for skill improvement which would be the subject of future research and will require a large sample to achieve increased robustness.
AUTHOR CONTRIBUTIONS Nachiketa Acharya: Conceptualization; data curation; formal analysis; investigation; methodology; validation; writingoriginal draft; writingreview and editing.Carlo Montes: Visualization; writingreview and editing.S. M. Q. Hassan: Writingreview and editing.Razia Sultana: Writingreview and editing.Md.Bazlur Rashid: Writingreview and editing.Md.Abdul Mannan: Writingreview and editing.Timothy J. Krupnik: Funding acquisition; writingreview and editing.Foundation (BMGF) under the thrid phase of the Cereal Systems Initiative for South Asia (https://csisa.org),and the One CGIAR Regional Integrative initiative Transforming Agrifood Systems in South Asia (TAFSSA; https:// www.cgiar.org/initiative/transforming-agrifood-systems-insouth-asia-tafssa/).The results of this research do not necessarily reflect the views of BMGF, CCAFS, USAID or the United States Government.We acknowledge the help of CPC, IRI and NCAR personnel in creating, updating, and maintaining the NMME archive.We are grateful to the anonymous reviewers for their insightful comments and suggestions that helped to improve the original version of the manuscript.ORCID Nachiketa Acharya https://orcid.org/0000-0003-3010-2158Md.Bazlur Rashid https://orcid.org/0000-0003-1789-6379 T A B L E 1 List of the seven NMME models used, the responsible institutions, number of ensemble members and reference.The Center for Ocean-Land-Atmosphere Studies 10 Kirtman et al. (2014) NASA-GEOS-S2S-2 National Aeronautics and Space Administration (NASA), Goddard Space Flight Center 4 (10 a ) Borovikov et al. (2017) GFDL-CM2p1-aer04 Geophysical Fluid Dynamics Laboratory 10 Delworth et al. (2006), Zhang et al. (2007) GFDL-CM2p5-FLOR-A06 Geophysical Fluid Dynamics Laboratory 12 Vecchi et al. (2014) GFDL-CM2p5-FLOR-B01 Geophysical Fluid Dynamics Laboratory 12 Vecchi et al. (2014) NCEP-CFSv2 NOAA's Centers for Environmental Prediction 24 (28 a ) Saha et al. (2019) Note: The lead-1 (initial conditions of May for forecasting JJAS total rainfall) hindcasts (spanning 1982-2010) and forecasts (spanning 2011-2018) of seasonal total rainfall for the JJAS season from these seven GCMs were used.a The value in parentheses shows the ensemble size for the forecast period.F I G U R E 1 Maps of (a) climatology (mean), (b) interannual variability (standard deviation) and the first empirical orthogonal function (EOF1) of total June-September rainfall from ENACTS-BMD product.[Colour figure can be viewed at wileyonlinelibrary.com]

F
I G U R E 2 Flow chart illustrating the steps of generation of seasonal forecasts using the calibrated multi-model ensemble approach.F I G U R E 3 Topographic map showing the spatial domain used as predictor for general circulation models (black square), and Bangladesh (predictand area).Elevation data obtained from the Shuttle Radar Topography Mission SRTM90 digital elevation model.[Colour figure can be viewed at wileyonlinelibrary.com]F I G U R E 4 Taylor diagram for prediction skill of country-averaged time series of total June-September rainfall of (a) uncalibrated and (b) calibrated GCMs: CanSIPSv2 (M1), COLA-RSMAS-CCSM4 (M2), GFDL-CM2p5-FLOR-A06 (M3), GFDL-CM2p5-FLOR-B01 (M4), GFDL-CM2p1-aer04 (M5), NASA-GEOSS2S (M6), and NCEP-CFSv2 (M7).Blue dashed line represents the standard deviation, and black dashed line the root mean square difference (RMSD).The red line represents the standard deviation of the country-averaged time series of observed total precipitation.[Colour figure can be viewed at wileyonlinelibrary.com]

F
I G U R E 8 Maps of Spearman correlation for uncalibrated (left column) and calibrated (right column) total June-September rainfall (1982-2018) for (a) GFDL-CM2p5-aer04, (b) NASA-GEOSS2S, and (c) NCEP-CFSv2.Dashed areas denote statistically significant correlations.[Colour figure can be viewed at wileyonlinelibrary.com]F I G U R E 7 Maps of root mean square error for uncalibrated (left column) and calibrated (right column) total June-September rainfall (1982-2018) for (a) GFDL-CM2p5-aer04, (b) NASA-GEOSS2S, and (c) NCEP-CFSv2.Colour bar in mm/season.[Colour figure can be viewed at wileyonlinelibrary.com]

F
I G U R E 9 Maps of generalised discrimination score (2AFC) for uncalibrated (left column) and calibrated (right column) total June-September rainfall (1982-2018) for GFDL-CM2p5-aer04 (a), NASA-GEOSS2S (b), and NCEP-CFSv2 (c).[Colour figure can be viewed at wileyonlinelibrary.com]F I G U R E 1 0 Maps of (a) root mean square error, (b) Spearman's correlation coefficient, and (c) Generalised discrimination score (2AFC) of total June-September rainfall (1982-2018) for uncalibrated (left column) and calibrated (right column) multi-model ensemble predictions.Dashed areas in (b) and (c) denote statistically significant correlations.[Colour figure can be viewed at wileyonlinelibrary.com]