PREDICTION OF pH AND TOTAL SOLUBLE SOLIDS CONTENT OF MANGO USING BIRESPONSE MULTIPREDICTOR LOCAL POLYNOMIAL NONPARAMETRIC REGRESSION

: Mango's internal quality can be determined based on its acidity and sweetness in the form of pH and total soluble solids (TSS) content. Research on fruit internal quality prediction based on near-infrared spectroscopy generally uses parametric regression modeling such as linear and partial least square regression. The study proposed biresponse multipredictor local polynomial nonparametric regression to determine mango's internal quality. The study aims to apply the theory of biresponse multipredictor local polynomial nonparametric regression for predicting the mango's internal quality in the form of pH and TSS value. We created R code for estimating nonparametric


INTRODUCTION
Generally, the internal quality of fruit can be evaluated using two approaches: destructive and non-destructive. Near-infrared spectroscopy (NIRS) has become one of the most promising and widely utilized non-destructive methods of investigation in many fields, including agriculture [1]. It has many advantages, such as rapid method, simple sample preparation, and environmental friendliness because no chemicals are employed, and minimal waste is produced [2]. Significantly, it can evaluate many quality attributes simultaneously [3]. One of the uses of NIRS in agriculture is to predict the internal quality of fruit and vegetables.
Mango is a climacteric fruit that is usually consumed when it is mature. Because of its appearance, taste, flavour, and overall nutritional benefits, mango (Mangifera indica L.) is one of the most significant and popular tropical fruits commercialized globally [4]. At maturity, most mango varieties are green and have low firmness, but external color and firmness alone are insufficient to suggest maturity [5]. Mature mangoes have low acidity and a high degree of sweetness. The pH value represents the acidity level of the fruit, while the Total Soluble Solids (TSS) content value represents the level of fruit sweetness. These two variables are indicators of the internal quality of mango. According to recent studies, pH and TSS content are valid parameters for mango maturity prediction. The pH and TSS value of fruit can valid evaluated using destructive analysis in the laboratory, which is complex sample preparation, timeconsuming, laborious, and produces waste. Therefore, it is better to predict pH and TSS values non-destructively based on NIRS. 3

PREDICTION OF pH AND TOTAL SOLUBLE SOLIDS CONTENT OF MANGO
Supply chain stakeholders need to manage fruit ripening to give customers the most excellent quality fruit [6]. Over the last few decades, NIRS has been commonly applied as a reliable and speedy non-destructive method for predicting fruit quality [7]. Regression analysis is a statistical tool for examining the functional relationship between response and predictor variables and it is used in the NIRS-based prediction of fruit internal quality. There are two methods for analyzing regression: parametric regression and nonparametric regression. The majority of previous mango quality prediction research focused on parametric regression techniques such as simple linear regression [8][9], multiple linear regression [9][10][11][12], partial least square regression (PLSR) [10][11][12][13], principal component regression [10][11] and Kernel PLSR [14][15]. Ulya et al. [16][17] performed a nonparametric regression prediction of mango fruit quality based on an uniresponse multipredictor local polynomial estimator.
One of the spline estimators in nonparametric regression model approach which has superior in smoothing capabilities for analyzing biresponse data is local polynomial estimator. Also, the nonparametric regression can be applied to biresponse and multipredictor matters. The biresponse local polynomial nonparametric regression can be implemented to various areas, including estimating the growth curve of children for two years using a biresponse local linear estimator [38], and modeling of children's weight in East Java [39]. Some cases of biresponse and multipredictor include predicting the internal quality of mango based on absorbance spectral data of NIRS [40].
The theory of estimating a nonparametric regression model based on a biresponse multipredictor local polynomial estimator has been developed by Ulya et al. [40]. Its results show that, theoretically, a nonparametric regression approach based on a biresponse multipredictor local polynomial estimator can predict mangoes' internal quality. However, its method only applies to simulation studies using generated data. It has not been implemented directly in actual cases. Therefore, further research is needed to implement the estimation theory of a nonparametric regression model based on a biresponse multipredictor local polynomial estimator to predict mangoes' pH and TSS values.
The study aims to apply a theory of nonparametric regression estimation of biresponse multipredictor based on local polynomial estimators for predicting the pH and TSS value of mango. Next, develop an algorithm for the estimation process and run the R code to estimate the predicted value of pH and TSS content. The study reveals that the nonparametric regression based on biresponse multipredictor local polynomial estimator is very accurate in predicting the pH and TSS values of mangoes. It is expected that the algorithm and R code developed in this study would serve as the basis for designing an instrument to detect mango maturity.

MATERIALS AND METHODS
Estimating local polynomial nonparametric regression model for biresponse multipredictor cases has been studied by [40]. The model can be used to predict the mango's internal quality, such as pH and total soluble solids that represent the acidity and sweetness of mango. The local polynomial nonparametric regression model for biresponse multipredictor is presented in the following Theorem and Lemma [40].
Theorem. Given the data in pairs ( ) () 12 , ,..., , , 1,2,..., where i is the number of observations, p is the number of predictor variables, and r is the number of response variables.
They meet the biresponse multipredictor nonparametric regression model as follows: is a function of unknown shape and ( )   The study used experimental data from 186 mango samples tested using NIRS to obtain spectral data and validated by destructive pH and TSS analysis in the laboratory. The type of mango in this study was Gadung Klonal 21, harvested from Sukorejo District, Pasuruan Regency, Indonesia. The predictor variables were the absorbance spectral data of NIRS that used wavelength of 900-1650 nm. In contrast, the response variables were the acidity and sweetness of mango in the form of pH and TSS content.
The study consists of several stages as follows: (1). The data collection experimentally using 6 MILLATUL ULYA, NUR CHAMIDAH, TOHA SAIFUDIN NIR spectrometer then validated the pH and TSS values using destructive analysis in the laboratory. (2). Spectral data pre-processing to minimize unwanted effects of spectral data. (3).
Calculate the predictive performance criteria. where n is the sample size of observations, ˆi y is the predicted value of the i th response variable, a i y is the measured value of the response variable, tr the training data; and ts is the testing data.
The best prediction model has the lowest MAPE. Moreno et al. [41] classified the MAPE value interpretation into four classes, namely MAPE value less than 10% is a highly accurate prediction, 10-20% is an accurate prediction, 20-50% MAPE is a reasonable prediction and MAPE value more than 50% is inaccurate prediction.

MAIN RESULTS
This section describes the implementation of nonparametric regression models based on the biresponse and multipredictor local polynomial estimator for predicting the mango's pH and TSS value.

Spectral Data Pre-processing
Before generating the prediction models, some pre-treatment approaches can minimize unwanted effects such as random noise, high-frequency noise, light scattering, and any other external 7 PREDICTION OF pH AND TOTAL SOLUBLE SOLIDS CONTENT OF MANGO effects induced by environmental or instrumental variables. Smoothing also successfully decreases high-frequency noise. Savitzky-Golay (SG) smoothing is one of the area's most extensively utilized smoothing algorithms [42]. The Unscrambler X 10.4 software was used to perform spectral pre-treatment based on SG smoothing. Figure 1 shows the raw and pre-treated (SG smoothing) spectral data. a. Raw spectral data b. pre-treated spectral data (SG smoothing)

Figure 1. Absorbance Spectral Data
After pre-treatment of the absorbance spectral data, PCA was used to decrease the dimension of the spectral data using the singular value decomposition algorithm. Two latent variables were selected to explain 99.75% of the variation. This study used Hotelling's T 2 ellipse analysis to identify spectral outliers. There were 21 outliers detected in this analysis. These outliers were removed because they had the potential to affect the model. Outliers in a sample can give useful information, but they can also be non-representative samples contributing to model errors [43].
The final sample included 165 observations separated into two parts: 80% training and 20% testing datasets (132 training data samples and 33 testing data samples).

The Dimension Reduction and Outlier Analysis
Before dimension reduction and outlier analysis, matrixes were created from the pH, TSS, and spectral data with the size of 186 x 114. The 186 samples are represented by the matrix rows, while the 114 columns represent the predictors (absorbance spectral data) and response variables 1 2 ( and ). yy The wavelengths of 112 NIR spectra from each mango sample were used as predictor

Regression Model for Predicting pH and TSS Values of Mango.
The algorithm in this study refers to previous study by [40], with the two response variables, including pH and TSS value. While the predictor variables are two variables derived from spectral data dimension reduction. The optimal bandwidth is determined by finding the minimum Generalized Cross Validation (GCV) value in each response and predictor.
Algorithm for determining the optimal bandwidth value without a weighting matrix ( −1 ) is as follows: Step 1: Defining the response variable () , 1,2 r yr = where the first response variable is pH value and the second one is TSS value, and the predictor variables , 1,2 = j xj Step 2: Determine the kernel function utilized, based on the Gaussian Kernel, using the following equation: Step 3: Defining the matrix 0 () x X according to Equation (2) Step 4: Determine the bandwidth values partially on the 1 to d polynomial order on each 9 PREDICTION OF pH AND TOTAL SOLUBLE SOLIDS CONTENT OF MANGO predictor for each response ( ) c h where ll is the lower limit and ul is the upper limit of the bandwidth value, and  is the addition of the bandwidth value for each iteration.
Step 5: Determine the diagonal matrix ( ( ) (7) Step 7: The set of bandwidth values that produces the minimum GCV value is the optimal bandwidth value on a predetermined order In addition to the algorithm above, this study also uses an algorithm for establishing a weighting matrix ( −1 ) and an algorithm for model estimation using a weighting matrix ( −1 ). The use of the two algorithms in this study refers to [40].

Statistical Descriptive of the pH and TSS Values
Before performing nonparametric regression analysis, it is necessary to analyze descriptively about the data. Table 1 presents a statistical description of the pH and TSS values of 186 mango samples. While Table 2 shows a statistical description of the pH and TSS values of 165 samples and the two predictor variables. The two predictor variables (x1 and x2) were the new variables that resulted from the dimension reduction of absorbance spectral data on 112 wavelengths 1 2 112 ( , ,..., ) P P P .  The next stage was made a scatter plot between the pH value with x1 and x2 and the TSS value with x1 and x2. The scatter plot can indicate whether there is a specific trend in the relationship between the response and predictor variable, as well as indications of cases of heteroscedasticity.  Figure 2 shows the scatter plot between pH value against x1 and x2, which shows a linear trend with a steep slope (Fig. 2a) and gentle slope, and the data is spread out (Fig. 2b). It was reinforced by the results of the linearity test which showed that the significance values for deviation from linearity were 0.591 (pH vs. X1) and 0.068 (pH vs. X2), which means that there is a linear relationship between pH and x1 and x2.

Determination of the Optimal Bandwidth
The next step is determining the optimal bandwidth value without weighting matrix ( −1 ) in the training data. The bandwidth determination is carried out partially in each response and predictor. Table 3 shows the optimal bandwidth determination based on the minimum GCV. Based on Table 3, the best order polynomial is first order that resulted minimum GCV. Besides, the linearity test also shows a linear relationship between the response variables, especially with x1.
So, the nonparametric regression modeling process will use a first-order polynomial. This step is expected to produce a regression model that fits the observation data.    Table 3, the best order polynomial is first order that resulted minimum GCV on each predictor for each response. So, the best combination bandwidths are shown in Table 4.  Table 4 are used for estimating the predicted value of pH and TSS, so the first and the second response error 2 values will be obtained, which will be used to create the ( −1 ) weighting matrix. The next stage estimates the biresponse multipredictor nonparametric regression model based on a local polynomial estimator using a weighting matrix A summary of the estimation results for the best combination can be seen in Table 5, which shows the MAPE and MSE values for the training, testing, and overall data. The observation plots and the results of the estimation of the two responses on training data can be seen in Figures 8 and 9. While the plots of testing data can be seen in Figures 10 and 11.    Figures 8 and 9 show that the predicted value is close to the observed value. It was 89% of training data that has absolute percentage error less than 10%. It illustrated that the model on sample data that was formed is very good. The results of estimating the pH and TSS values using biresponse multipredictor nonparametric regression based on the local polynomial estimator on training data yield a MAPE value of 3.729%. The MAPE value is still less than 10% which is included in the category of highly accurate prediction [41].   Figures 10 and 11 show that that the predicted value is quite close to the observed value on testing data.
The results of estimating the pH and TSS values using biresponse multipredictor nonparametric regression based on the local polynomial estimator on testing data yield a MAPE value of 7.466%. The nonparametric regression model is highly accurate in predicting mango's pH and TSS value on testing data with less than 10% MAPE value [41].
The MAPE value of overall data in this study is 4.473%. The study proves that the bi-responses multi-predictors local polynomial nonparametric regression model is highly accurate for predicting the mango's pH and TSS values. It is in line with [16][17] that local polynomial nonparametric regression is highly accurate for predicting the sweetness and acidity of mangoes. In fact, the MAPE values of local polynomial nonparametric regression is lower than the MAPE value of parametric regression on the case of predicting mango's pH and TSS values [16][17].

CONCLUSIONS
The biresponse multipredictor local polynomial nonparametric regression model estimation can be implemented to solve the prediction of mango's internal quality. The study shows that the biresponse multipredictor local polynomial nonparametric regression is highly accurate in predicting the pH and TSS of mango with the MAPE value of 4.473%. The algorithm and R code developed in this study are expected to be useful for designing an instrument to detect mango maturity.