APPLICATION OF LASSO FOR IDENTIFICATION OF FUNCTIONAL GROUPS WITH SIGNIFICANT CONTRIBUTIONS TO ANTIOXIDANT ACTIVITIES OF CENTELLA ASIATICA

: High-dimensional data has more variables than observations (p>>n). In this case, modeling with regression analysis becomes ineffective because it will violate the multicollinearity assumption. The least absolute shrinkage and selection operator (LASSO) can handle high-dimensional data and multicollinearity because LASSO works by reducing the parameters of variables with significant effects and selecting variables with minor effects. In its application, several variables have the same characteristics. Reducing and selecting variables in the form of groups can solve the problem so that the group LASSO can be used as a solution. This study used data on antioxidant activity in C. asiatica. It is a plant that contains antioxidants. The spectroscopic technique can find important information about antioxidants, namely the Fourier transformed infrared spectrophotometer (FTIR). FTIR is a spectroscopic technique based on molecular vibrations subjected to infrared so that it can characterize molecules with functional groups. FTIR data has large dimensions and multicollinearity. This study has 1866 explanatory


INTRODUCTION
High-dimensional data has more explanatory variables than observations (p>>n). In handling this data, regression analysis modeling becomes ineffective because the assumption of multicollinearity will be violated [1]. Multicollinearity in regression analysis causes the estimation of model parameters to be biased and difficult to interpret with many analyzed variables. [2] developed LASSO by adding the norm 1 penalty function to the regression model to overcome this. LASSO works by reducing the parameters of a variable with a significant effect and selecting a variable with a small impact.
LASSO can solve high-dimensional data problems. It is because LASSO can reduce the regression coefficient of variables that have a high correlation to error, with the aim that the regression coefficient is close to zero or equal to zero so that LASSO acts as a variable selection as well overcoming multicollinearity [3]. LASSO does not have an explicit solution for determining its estimated coefficients, so computational programming is needed to solve it [4].
One of the algorithms that are effective in helping to solve the LASSO regression solution computationally is the Least Angle Regression Selection (LARS) algorithm.
This study used data on antioxidant activity in C. asiatica. It is one of the plants that can be used as a natural antioxidant. [5] concluded that the methanol extract of the C. asiatica herb has 3 ANTIOXIDANT ACTIVITIES OF CENTELLA ASIATICA asiatica can be influenced by differences in the area of origin, climate (temperature, humidity, light, and wind), and geographical conditions [6]. Find out crucial information about the antioxidant activity of the C. asiatica plant and it can be done by using spectroscopic techniques.
Fourier-transformed infrared spectrophotometer (FTIR) is a spectroscopy based on molecular vibrations subjected to infrared to characterize molecules with functional groups.
In general, spectroscopic data has large dimensions, and there are multicollinearity problems because spectroscopic data has more independent variables than the number of observations (p>>n) [7]. In this study, there were 1866 wave numbers as explanatory variables and 15 observational data, and there was a multicollinearity problem, so statistical techniques were needed to deal with this problem. Several previous studies have discussed this, such as a study that applied the Support Vector Machine method for classifying six Zingiberaceae plants using selected variables from the genetic algorithm [8]. [9] applies the partial least square (PLS) for finding functional groups having a major contribution to the antioxidant activity of Syzygium polyanthum. Previous research on the correlation between metabolites profiles and bioactivities of Curcuma aeruginosa has applied principle component analysis and PLS [10]. A study to determine the performance of PLS and LASSO regression on microarray data concluded that LASSO performance was better than PLS regression [11].
In this study, we used the LASSO method to predict the functional group of the antioxidant compound using FTIR spectra of C. asiatica because it follows high-dimensional data. In addition, LASSO could also reduce the regression coefficient of variables with high correlation error, with the aim that the regression coefficient is close to zero or equal to zero so that the LASSO method acts as a variable selection as well to overcomes multicollinearity [3]. LASSO works by adding constraints to the least-squares method. [12] applied the LASSO method to high-dimensional data. They found that the regression coefficients generated from the LASSO method were more capable of selecting explanatory variables than the multiple regression method. The LASSO method does not have a straightforward solution for determining the estimated coefficient, so computational programming is needed to solve it [4]. One of the algorithms that are effective in helping to solve the LASSO regression solution computationally is the LARS algorithm.
In addition, the advanced method of LASSO, namely group LASSO, is also used in this study because group LASSO is usually used for grouped data. FTIR data tend to be clustered, this is based on the theory that two compounds provide an infrared absorption peak at the exact location, so it can be said that the two compounds are identical [13]. Research conducted by [14] applied the group LASSO to high-dimensional FTIR data with 1798 wave numbers as explanatory variables and 280 observational data, concluding that the group LASSO provides a high model accuracy in handling high-dimensional data and contains multicollinearity. Therefore, in this study, LASSO and group LASSO are used to identify functional groups that contribute majorly to the antioxidant activity of C. asiatica.

Centella Asiatica
C. asiatica is one of the plants that are easy to grow in tropical and subtropical areas [15]. C. asiatica is a plant easily found in Indonesia and used as an herbal plant. Based on previous research, C. asiatica has many benefits and properties related to antimicrobial, antioxidant, wound healing, anti-inflammatory, and anticancer activities [16]. C. asiatica contains polyphenols, flavonoids, carotene, tannin, vitamin C, and triterpenoids (asiaticoside) which have antioxidant activity [17]. Based on previous research, methanol extract and water extract from C. asiatica can protect DNA damage [5].

Antioxidant Activity
Antioxidants play an essential role in protecting cells from reactions caused by free radicals because antioxidants will capture free radicals [18]. Free radicals are compounds that contain unpaired electrons in their orbits so that they are very reactive with surrounding molecules, and free radicals will cause damage to lipids, proteins, and nucleic acids. If the number of free radicals is more than antioxidants in the body, it can cause oxidative stress. Oxidative stress 5 ANTIOXIDANT ACTIVITIES OF CENTELLA ASIATICA plays an essential role in several diseases, such as atherosclerosis, chronic kidney failure, diabetes mellitus, cancer, premature aging, cardiovascular disease, and neurological diseases [19].
Antioxidants are compounds that can slow the oxidation process of biological molecules or suppress free radicals [20]. Compounds that have antioxidant effects are phenols and polyphenols, and the most common are flavonoids (flavonols, isoflavones, flavones, catechins, and flavanones), cinnamic acid derivatives, coumarins, tocopherols, and polyfunctional organic acids [21].

Fourier transformed infrared spectrophotometer (FTIR)
The FTIR spectrophotometer allows for simultaneously measuring infrared absorption at various wavelengths [14]. Infrared absorption in a specific wavenumber region can be used to determine the functional groups formed. An organic compound can be identified from the functional groups contained in it. Different combinations provide different functional groups and infrared absorption forms [14]. If two compounds provide infrared absorption peaks at the exact location, they can be said to be identical [13]. The spectral data produced by the FTIR is quantitative data which generally has large dimensions and contains multicollinearity because the spectroscopic data have several independent variables (p) more significant than the number of observations (p>>n) [7].

LASSO
Tibshirani introduced the least absolute shrinkage and selection operator (LASSO) method in 1996. The coefficient estimator in the LASSO method is obtained by minimizing the following Eq. (1) [2].
≤ . The value of t is the quantity that controls the shrinkage of the coefficient estimator with ≥ 0, p is the number of explanatory variables, and N is the number of observations. The LASSO estimator is obtained by specifying a standardized limit, namely = 1. Standardize the independent variable to have a median value of zero and a variance of one.

Find the independent variable
that is most correlated with r. Modification of the LAR algorithm to get the LASSO solution is to modify the 4th step, namely by: 4a. If the non-zero coefficient reaches zero, remove the variable from the active variables and recalculate the OLS direction together.

Group LASSO
The Group Least Absolute Shrinkage and Selection Operator (Group LASSO) method is a regression technique that applies LASSO selection to the explanatory variables in groups. This 7 ANTIOXIDANT ACTIVITIES OF CENTELLA ASIATICA grouping aims to facilitate selecting explanatory variables with similar characteristics. Group LASSO allows variables to group in large numbers with uneven numbers [25]. The coefficient estimator in the group LASSO is obtained by minimizing the following sum of squares of the remainder as Eq. (2) [26]: where k is the number of groups, is the j th explanatory variable, is the j th regression coefficient, and is the number of variables in the j th group. While ≥ 0 is a controller of the amount of depreciation. The model will be in standard form when = 0. If the value is getting bigger, then the coefficient's estimated value will be smaller towards zero for going to infinity.

Data Sources
The data used is the result of FTIR measurements to see the antioxidant content of the C. asiatica.
This data results from an experiment by the Biopharmaceutical Study Center Team in collaboration with the Statistics Department of IPB in research conducted by Putri [27]. The observations used were obtained by extracting the C. asiatica with five extract materials, namely water, 30% ethanol, 50% ethanol, 70% ethanol, and p.a ethanol for each extract material.
Experiments were carried out with three replications to produce 15 objects of observation.
Furthermore, measurements were carried out using an FTIR spectrophotometer, resulting in 1866 absorption points. The explanatory variable (X) used is wave number. While the response variable (Y) used is the antioxidant activity of the C. asiatica extract.

Procedure Of Analysis
Data processing begins with exploring the data by standardizing the value of the explanatory variable and the response variable to have a mean of zero and a variance of 1. Then calculate the correlation value between the explanatory variables. The second step is to apply the modified LARS algorithm for LASSO to the data used. This step begins with determining each iteration's LASSO regression coefficient estimator, selecting the best model using a leave one out cross validation (LOOCV), and identifying functional groups that affect antioxidant activity with LASSO. The third step is to apply the group LASSO method. At this step, it is necessary to classify the variables based on chemical properties. Next, determine the best model by looking at the lambda that provides the minimum residual cross-validation (CVE) value and identifies the functional groups that affect antioxidant activity with the group LASSO. The fourth step is to identify the functional groups that contribute to the antioxidant activity of C. asiatica. The last step is to determine the size of the goodness of the LASSO analysis model and the group LASSO.

Data exploration
FTIR spectroscopy is important in fingerprint analysis because it can display different spectral patterns for each active compound [28]. This study used spectral data from FTIR measurements on C. asiatica plants. Each extract material was experimented with three times so that each color in Figure 1 represents the FTIR spectrum for each extract material in each repetition. The resulting FTIR spectrum is in the range of wave numbers between 399.2373 cm -1 to 3996.232 cm -1 . Figure 1 and 2 shows the infrared absorption intensity, which is not much difference between each spectrum of each extract material for each repetition. The infrared absorption causes a relatively similar pattern for the peaks or valleys formed in each spectrum at a specific interval. The next step is to calculate the correlation between the explanatory variables. In Figure 3, the higher the correlation between independent variables, the stronger the color formed. Figure 3 is dominated by dark blue, meaning there is a multicollinearity problem in the data used.

LASSO Analysis
The LASSO coefficient estimator is obtained through computation by modifying the LAR algorithm, commonly called LARS, to produce an algorithm that is more efficient than quadratic programming. Estimating the LASSO coefficient is carried out in iteration by setting all initial coefficients to be zero. Then, the independent variable that is most correlated with error is entered into the model. The plot of the stages of the independent variables that enter the model can be seen in Figure 4. to choose the best model. In Figure 4, the variable X1490 is included in the model in the first iteration, meaning that the variable X1490 has the highest correlation with the remainder compared to the other variables, with s around 0.005. Furthermore, in the second iteration, X1145 is entered into the model. In the second iteration, the model has two variables, namely X1490 and X1145, with s around 0.0183. And so on until the variable X1168 enters the model at stage 65.
The selection of the best model is made by using cross-validation. In the package, the LARS algorithm uses fraction mode to see at what iteration it produces the best LASSO regression 11 ANTIOXIDANT ACTIVITIES OF CENTELLA ASIATICA model. The optimal value of s can show this. The minimum cross-validation value indicates the optimal s value. The value of cross-validation can be seen in Figure 5. After knowing the best model, the next step is to find the functional groups that affect the antioxidant activity by paying attention to Table 1 [29]. Based on Table 1, the functional groups that affect the antioxidant activity of C. asiatica are -NH, -OH, and C-O.

Group LASSO Analysis
Group LASSO analysis was applied to data with grouped explanatory variables. Therefore, it is necessary to group the explanatory variables in the data to be used based on their characteristic equations. This grouping is done by looking at the peaks of the FTIR spectrum. The explanatory variables formed in one wave's crest are considered in the same group. This is based on the theory that if two compounds provide an infrared absorption peak at the same location, it can be said that the two compounds are identical [13]. The grouping of explanatory variables can be seen in Figure 6. Based on Figure 6, data has a relatively similar pattern for the peaks or valleys formed by each spectrum at a specific interval. Thus, the explanatory variables can be grouped in the same way of grouping, as shown in Table 2. this, the groups that affect the antioxidant activity of C. asiatica are shown in Figure 7.  Table 1 shows that the functional groups that affect antioxidant activity are -NH, -OH, C-O, and -CH.

Identification of Functional Groups Contribute to the Antioxidant Activity
The functional groups that made a major contribution to this research were those identified using LASSO and the group LASSO. Based on Table 3, the functional groups that have a major effect on the antioxidant activity of C. asiatica are -NH and -OH, and C -O. all the variables in that group are included in the model so that the functional groups identified using the group LASSO are more than the group LASSO.
In general, it can be seen that metabolites with functional groups -NH, -OH, and C -O are thought to be metabolites that contribute to the major antioxidant activity of C. asiatica extract.
The functional groups identified are the cumulative result of several compounds that act as antioxidants. Based on these results, it can be assumed that the functional groups identified are derived from phenolic compounds. This result follows the research conducted by [30] that phenolic compounds are the most significant contributor to antioxidant activity.

The Measure Of Goodness
The measure of goodness used is Mean Squared Error (MSE). MSE is the average value of the square of the error. MSE was used to compare the ability between the LASSO analysis and the group LASSO for high-dimensional data and the explanatory variables formed groups. MSE indicates the size of the error, so the smaller the MSE value, the better. Measures of the goodness of these two models are in Table 4. Based on MSE, the group LASSO is better than the LASSO with modification of the LAR algorithm in modeling functional groups that have a major contribution to antioxidant activity.

CONCLUSION
This study concluded that group LASSO was better than the LASSO with modification of the LAR algorithm in identifying functional groups that had a major contribution to antioxidant activity. The metabolites having -NH and -OH and C -O functional groups were suspected as metabolites that contributed to the major antioxidant activity of C. asiatica extract. The functional groups identified are derived from phenolic compounds, the most significant