Evaluation of multiple linear regression function and generalized linear model types in estimating natural menopausal age: A cross-sectional study

Abstract Background Since women spend about one-third of their lifespan in menopause, accurate prediction of the age of natural menopause and its effective parameters are crucial to increase women's life expectancy. Objective This study aimed to compare the performance of generalized linear models (GLM) and the ordinary least squares (OLS) method in predicting the age of natural menopause in a large population of Iranian women. Materials and Methods This cross-sectional study was conducted using data from the recruitment phase of the Shahedieh Cohort Study, Yazd, Iran. In total, 1251 women who had the experience of natural menopause were included. For modeling natural menopause, the multiple linear regression model was employed using the ordinary least squares method and GLMs. With the help of the Akaike information criterion, root-mean-square error (RMSE), and mean absolute error, the performance of regression models was measured. Results The mean age of menopausal women was 49.1 ± 4.7 yr (95% CI: 48.8-49.3) with a median of 50 yr. The analysis showed similar Akaike criterion values for the multiple linear models with the OLS technique and the GLM with the Gaussian family. However, the RMSE and mean absolute error values were much lower in GLM. In all the models, education, history of salpingectomy, diabetes, cardiac ischemic, and depression were significantly associated with menopausal age. Conclusion To predict the age of natural menopause in this study, the GLM with the Gaussian family and the log link function with reduced RMSE and mean absolute error can be a good alternative for modeling menopausal age.


Introduction
Natural menopause comes after a permanent cease of menstruation for 12 months with a decrease of blood estrogen levels and an increase of follicle-stimulating hormone, which results in cessation of ovarian follicular development (1)(2)(3). This natural transition occurs in almost all women without any pathological or physiological causes (2). Evidence shows that the age at natural menopause differed among different races, so the mean age of menopause was 51.4 and 49-50 yr in western and Asian women, respectively (3). According to the literature in previous studies, menopause is associated with factors such as socioeconomic status, physical activity, marital status, education level, fertility history, etc. However, various factors can affect the age of natural menopause across the world (2,4). With the onset of menopause, biochemical and hormonal changes occur, such as decreased estrogen hormone levels that cause behavioral (5) and physical disorders (6,7). The premature or late occurrence of these changes may bring about premature menopause or late-onset menopause increasing the risk of having certain diseases in women. According to previous studies, early menopause correlates with an increased mortality rate caused by cardiovascular diseases (8,9) and osteoporosis (10). While late menopause increases the risk of ovarian (11,12), uterus (13), and breast cancers (14,15).
Since women spend about one-third of their lifespan in menopause, accurate prediction of the age of natural menopause and its effective parameters are crucial to increase women's life expectancy (3). In medical data analysis, it is not always possible to establish regression assumptions. For example, it is possible that the response variable does not have a normal distribution, and even with different transformations by the researcher, a normal distribution for the response variable is not obtained. In these cases, generalized linear models (GLM) can be used due to less sensitivity when predicting than establishing regression assumptions. To the best of our knowledge, very few studies applied GLM to determine menopause age. The GLM models formed in 3 steps. In the first step, a suitable distribution for the response variable is determined, which could be a member of the exponential family. For example, for variables with the continuous response, Gaussian, Gamma, and Gaussian inverse distributions could be included. The second step is about the formation of the systematic component. This component is created by the linear combination of the predictor variables. The third step included applying the link function to establish a relationship between the random and systematic components (16)(17)(18).
Since regression assumptions are important for the ordinary least squares (OLS) model, in the absence of these assumptions, a common method before GLM models was to use transformations to normalize the distribution of the variable yresponse and the variance stability. Nonetheless, for most types of data, such as medical data, it is difficult to find a transformation that stabilizes the variance in addition to the ability to normalize the data. In such cases, the type of transformation for normalization is usually different from the best transformation for variance stability. Another advantage of the GLM model is that the selection of the link function is different from the selection of the random component in this model. In the GLM model, the link function does not need to stabilize the variance and normalize the data. The fitting process does not limit this choice to the normal distribution by increasing the probability of choosing the probability distribution for Y.
These advantages of GLM models can be important for data from medical studies because regression assumptions cannot be established by conventional methods in some cases, and prediction error increases if these assumptions are not established, and the OLS models are used (19).
This study aims to compare the performance of various GLMs and OLS methods in predicting the age of natural menopause in a large population of Iranian women. We also examined and compared GLM and OLS in estimating the natural menopausal age and its effective factors.

Materials and Methods
In this cross-sectional study, the data of 1706 women who were experienced menopause, registered in the Shahedieh Cohort study, Yazd, Iran from April 2015 to September 2017 were extracted.
The Shahedieh cohort study is a part of the PERSIAN cohort. The cohort study was carried out in Shahedieh, Zarch, and Ashkezar cities of Yazd due to homogeneity among people regarding a limited number of immigrants and emigrants, homogenous ethnicity, and local cooperation.
After establishing equipping the cohort center, the human resources were employed, and the research officers' team was trained to collect the required information. To collect the research data, residents of Shahedieh City who had 35-70 yr of age were required to refer to the predetermined health centers. Later, the participants' information was collected using a standardized questionnaire. Further information about the standardized questionnaire and PERSIAN Cohort can be found in related research (20). The inclusion criterion of this study was the women aged 35-70 yr who experienced menopause (21). The exclusion criteria were as follow: women with other menopausal etiology including breastfeeding, hormonal disorders, surgeries like hysterectomy or oophorectomy (22) and the individuals who had answered the question "Was your menopause a natural one?" No. After excluding the abovementioned participants, a total of 1251 women participated in the study.
In this study, independent variables on the menopausal age were measured considering the present variables in cohort questionnaires, which included demographic characteristics (e.g., age, education level, and marital status), fertility history, body mass index (BMI), waistline, individual habits, social and economic status, physical activity, employment status, and some chronic illnesses. The participants' age was calculated when their entrance to the research. Based on the participants' last educational degree, the level of education was categorized into 5 groups: 1) illiterate, 2) elementary, 3) secondary, 4) high school, and 5) academic.
Single participants were 1 in 1251, and only 8 widows were reported. Therefore, due to a very limited number of samples, the singles, widows, and divorcees were merged into 1 group resulting in 2 main categories of 1) married and 2) widow/divorcee/single. The participants' fertility history included the natural menopause age, menarche age, age at the first pregnancy, intake of oral contraceptives, number of pregnancies, use of infertility-related medications, history of salpingectomy and infertility.
The natural menopausal age was defined as the person's age at her last menstrual period that occurred after 12 months of menstruation (without pregnancy, lactation, etc.), and the menarche age was the age at first menstruation. Other variables included the use of oral contraceptives, salpingectomy history, infertility history, and intake of infertility-related medications, which were evaluated using 2 levels of Yes/No. Unfortunately, no data was found in the cohort profile concerning the period of using oral contraceptives and the dosage of other related medications for infertility treatment. The questionnaire sufficed to only one yes/no question that asked them whether they had used the medications or not. Thus there was no opportunity to examine such data in the present study.
The participants' BMI was also obtained by dividing weight in kilograms by square meters of height in meters (kg/m 2 ). This variable was categorized under 3 levels of 1) ≤ 24.99 kg/m 2 , 2) 25-29.99 kg/m 2 , 3) ≥ 30 kg/m 2 . The waistline was measured in cm individually. The socio-economic status questionnaire contained questions including the highest educational qualification, housing status, number of households, and home facilities. After gathering data, the participants were classified as weak, moderate, and strong about their socio-economic status using the clustering method. The international physical activity questionnaires were administered to assess each individual's physical activity. The validity and reliability of this questionnaire were confirmed in Iran. The multiples of the resting metabolic rate (MET) were measured according to the international questionnaire for each individual.
According to this questionnaire, the total energy gain ranging from 0-599 (MET-min/wk) shows poor physical activity, from 600-3000 (MET-min/wk) indicates moderate physical activity, and higher than 3000 (MET-min/wk) represents severe physical activity. Individual habits include 2 variables of drug use, hookah, which were investigated using Yes/No levels. Smoking cigarettes is considered improper in Iran and Islamic culture accordingly among the 1251 women, even less than 5 were reported as smokers. Since this included smoking hookah, we merged them all with hookah smokers in 1 group. The period and the amount of usage were not asked in the questionnaire.
Chronic diseases considered in this study included the history of diabetes, cardiac ischemic, thyroid, kidney stones, gallstones, rheumatism, chronic headache, and depression. Since menopause begins 3-5 yr before its occurrence, we studied individuals whose diseases were diagnosed at least 4 yr before menopausal age.

Ethical considerations
The research proposal was approved by the Ethics Committee of School of Public Health, Shahid Sadoughi University of Medical Sciences, Yazd, Iran (Code: IR.SSU.SPH.REC.1397.066).

Statistical analysis
Statistical analyses were performed using R software, version 3.6.2 (R Core Team, R Foundation for Statistical Computing, Vienna, Austria) and SPSS software version 24 (IBM Corporation, Armonk, NY, USA).
To compare the mean age at menopause among different levels of variables which were categorized in 2-groups and multi-groups; in cases where the age at menopause was normal at each level of the variable, the parametric t test and ANOVA test were carried out respectively, however, in cases where the age wasn't normal, the nonparametric Mann-Whitney U-test and Kruskal-Wallis test were carried out. The correlation between the quantitative variables and the menopausal age was measured using the Pearson correlation coefficient. A significance level of 5% is considered. Aiming to investigate the impact of probable influential variables on menopausal age, the multiple linear regression method was applied using OLS and GLMs.
Multiple linear regression using the OLS technique is common in predicting natural menopausal age. However, the GLM is more general than the linear model since, in GLM, the response variable can have abnormal distribution. Moreover, the mean response variable can have a non-linear relationship with the predictor variables. Linear regression is a special case of GLM. To the best of our knowledge, very few studies applied GLM to determine menopausal age. The GLM models were formed in 3 steps. In the first step, we identified a suitable distribution for the response variable, which could be a member of the exponential family. Since menopause is a positive and continuous variable, GLM models were applied in this study, such as Gaussian, Gamma, and Gaussian inverse distributions. The second step was about the formation of the systematic component. This component was formed by a linear combination of the predictor variables. The third step included applying the link function to establish a relationship between the random and systematic components. The systematic component can hold any real number, but the variable can accept specific numbers as a subset of real numbers. An appropriate link function selection allows the prediction range to be within the response variable range. The log link function is an efficient link function for non-negative data (16)(17)(18).
The classic presumptions of regression were examined to conduct the multiple linear regression method with the OLS technique. The normality of menopausal age was examined utilizing Skewness and Kurtosis measures. The variance inflation factor was applied to identify the correlation between variables. If the variance inflation factor were greater than 10, a coincidence would exist between the independent variables (23).
The performance of regression models was examined by comparing them to each other using the Akaike information criterion (AIC) and also the root-mean-square error (RMSE) and mean absolute error (MAE) for measuring the error amount.
The AIC shows the extent of data loss by applying the considered statistical model. The MAE is the MAE between the predicted and observed menopausal age, and finally. The RMSE is the root mean square error of the predicted and observed menopausal age difference (24-26).

Results
This cross-sectional cohort study investigated 1251 women who experienced natural menopause.  (Table I).
As the found results claimed, the age of menopause has a direct and significant relationship with the number of pregnancies and the waistline.
To put it in other words, with an increase in the number of pregnancies or the waistline, the age of menopause increases (Table II).
In all regression models, education, history of salpingectomy, diabetes, cardiac ischemic, and depression were significantly associated with menopausal age (Table III) (Table III). The distribution of menopausal age was illustrated in figure 1. The skewness and kurtosis of the natural menopausal age were from -2 to 2, which indicated the normal distribution of this variable.

Discussion
The mean and standard deviation of the  (27).
However, such a result was not found in the present study (25); we found that higher education had a negative coefficient in the regression equation. This may be due to the low number of higher education levels. Some studies confirmed the relationship between BMI and menopausal age (28,29). But in the present study, this association was not significant, consistent with some previous studies in Iran. Furthermore no relationship was found between the waistline and menopausal age (3,25). Multiple linear regression analysis showed no association between marital status and menopausal age. This was consistent with some studies (2, 3) and contradicted some others (4).
According to studies, smoking and hookah have affected the factors such as gonadotropins and steroid hormones and leading to reduce follicle storage and ovarian aging, which is one of the causes of premature menopause (30). But in the present study, no association was found between smoking and hookah use and menopausal age.
However, Yang and co-workers reported that smokers were at risk for early menopause (31).
Several studies in Iran indicated that smoking was not associated with menopausal age. According to a study in Isfahan, this finding can be attributed to the fact that smoking is a taboo for women in the Islamic culture of Iran; therefore, women may not report it accurately (3,25). Also, in the present study, we were faced with limitations such as low sample size in some qualitative variables and absence of the mother's menopausal age variable.

Conclusion
In the present study, the mean and standard