Identification of predictive factors of the degree of adherence to the Mediterranean diet through machine-learning techniques

Food consumption patterns have undergone changes that in recent years have resulted in serious health problems. Studies based on the evaluation of the nutritional status have determined that the adoption of a food pattern-based primarily on a Mediterranean diet (MD) has a preventive role, as well as the ability to mitigate the negative effects of certain pathologies. A group of more than 500 adults aged over 40 years from our cohort in Northwestern Spain was surveyed. Under our experimental design, 10 experiments were run with four different machine-learning algorithms and the predictive factors most relevant to the adherence of a MD were identified. A feature selection approach was explored and under a null hypothesis test, it was concluded that only 16 measures were of relevance, suggesting the strength of this observational study. Our findings indicate that the following factors have the highest predictive value in terms of the degree of adherence to the MD: basal metabolic rate, mini nutritional assessment questionnaire total score, weight, height, bone density, waist-hip ratio, smoking habits, age, EDI-OD, circumference of the arm, activity metabolism, subscapular skinfold, subscapular circumference in cm, circumference of the waist, circumference of the calf and brachial area.


INTRODUCTION
In the context of nutrition and public health, the Mediterranean diet (MD) has been forged over the 48 centuries, being characterised by cereal, olive oil, low saturated fats and meat, moderate consumption 49 of dairy and a regular and moderate intake of wine, being a lifestyle in accordance with geographic, 50 climatological, orographic, cultural and environmental conditions within the countries and regions that 51 surround the Mediterranean Sea (Pérez C, 2011). 52 There is an increasing interest in the study of the preventive role of MD and also as a treatment 53 for various pathologies associated with chronic inflammation, such as metabolic syndrome, diabetes 54 mellitus, cardiovascular disease (CVD), neurodegenerative diseases, breast cancer and psycho-organic 55 deterioration, leading to greater longevity and better quality of life (Dussaillant et al., 2016;Chrysohoou 56 et al., 2004;Trichopoulou, 2004;Serra-Majem et al., 2006;Estruch et al., 2013;Sofi et al., 2014;Della 57 Camera et al., 2017). Moreover, the importance of MD has also been identified as a potential element 58 contributing to the prevention of breast cancer (Shapira, 2017) or in patients carrying the BRCA mutation 59 (Bruno et al., 2017). In 2010, UNESCO declared this diet an Intangible Cultural Heritage of Humanity 60 (UNESCO, 2010). 61 Numerous studies have been published over the past decades, showing the relationship between MD 62 intake and CVD Widmer et al., 2015), and meta-analyses that relate it to 63 general health status (Sofi et al., 2014). In the Greek cohort EPIC (European Prospective Investigation 64 into Cancer and Nutrition Study) a 2-point increase in adherence to this diet was associated with a 65 33% reduction in CVD mortality (Sofi et al., 2014). Additionally, the analysis of a sub-cohort of 2,700 66 individuals over 60 years old, with a history of myocardial infarction showed that a greater adherence 67 to MD had an 18% drop in overall mortality (Lack et al., 2003). Other studies have confirmed these 68 associations, including the follow-up of a Spanish cohort of 13,600 adults with coronary heart disease. 69 After 5 years, it was observed that 2 points of increase in adherence to MD were associated with a 26%  Eating disorders are linked to a distorted perception of one's own body image, as well as to body 73 dissatisfaction. The importance of a study on body dissatisfaction is due to the fact that recent investi- 74 gations have confirmed that alterations in body image have a causal participation in an eating disorder, 75 rather than being secondary to it (Míguez Bernárdez et al., 2011). Body image is considered a qualitative 76 approximation to the nutritional status of the individual (Sámano et al., 2015) and can be determining for 77 their nutritional management (Martínez-González et al., 2011). 78 One of the main fields of application of Machine-Learning (ML) techniques since its origins is in 79 the field of Biomedicine, finding previously published studies in related areas such as biomedical image 80 (Fernandez-Lozano et al., 2016b), characterisation of different types of carcinomas (Kim et al., 2017), 81 measurement of activity in genetic networks (Hu et al., 2016), deformable models for image comparison 82 (Rodriguez et al., 2014), gene selection, and classification of microarray data (Díaz-Uriarte and Alvarez 83 de Andrés, 2006), to name a few.

84
Moreover, due to the great versatility of ML techniques, they have been used in a wide variety of 85 application areas, to discover hidden patterns in the datasets: identification and authentication of tequilas 86 (Pérez-Caballero et al., 2017), wearable sensor data fusion (Kanjo et al., 2018), predicting the outcomes of 87 organic reactions (Skoraczyński et al., 2017), animal behaviour detection (Pons et al., 2017) or to measure 88 the visual complexity of images (Machado et al., 2015). In particular, ML techniques have proven to be 89 able to uncover unimaginable relationships in very diverse fields of application, such as image or voice 90 recognition, sentiment analysis or language translation (Li et al., 2015;Perez-de Viñaspre and Oronoz, 91 2015).

92
The main objective of this work is the development of ML models for the prediction of the degree 93 of adherence to the Mediterranean diet. To this end, information on different anthropometric and socio-94 demographic variables, nutritional status and self-perception of body image is used in order to identify 95 which of the variables have a greater influence and are key in the adherence to a healthy diet such as MD, 96 allowing our patients to improve their quality of life and to reduce the negative effects of well-known and 97 related diseases.

98
Taking into account all of the above, the experimental methodology proposed in the development 99 of this study is based on the collection and generation of data to be analysed with our cohort in Galicia

Manuscript to be reviewed
Computer Science (Spain), as well as on the use of ML techniques. The purpose is to extract and explain the underlying 101 information in the data and determine which of these variables are the most important to classify people as 102 having either a good or poor adherence to the MD. As mentioned before, there are several health benefits 103 related to this particular food diet, especially for: chronic inflammation, metabolic syndrome, diabetes 104 mellitus, CVD, neurodegenerative diseases, cancer and psycho-organic deterioration, moreover leading to 105 greater longevity and better quality of life. Thus, this study is relevant for understanding how to measure 106 the degree of adherence, in order to ensure the aforementioned benefits.

107
The structure of the article is as follows: in the Materials and methods section, the subjects are 108 presented, the variables are measured for each of them. Next, the machine learning and feature selection 109 techniques are described, along with the experimental design followed to ensure that the results are 110 reproducible and representative of the studied problem. In the next section, the results are presented and 111 discussed, and the final section of the article includes the conclusions of the work.

113
The present study was structured as follows. Initially, a population from our cohort was selected to carry 114 out the study; the population was grouped into two categories: with high and low degree of adherence to 115 the MD. Once the set of the population on which the study will be carried out has been identified, the 116 information is collected from each of the users of the health system. The type of study carried out will 117 be described below, as well as the sample size will be justified and all measurements collected will be 118 explained in detail. Once the dataset is generated, it will be analyzed with four different ML techniques 119 and a feature selection phase will be applied for dimensionality reduction.  The sample size was calculated taking into account the total population of the municipality (n = 127 12, 446). After stratification by age and gender, (n = 503) persons were selected to participate in the 128 study. Sample size was estimated using the single proportion formula, with 95% confidence Interval. A 129 sample size of (n = 503) subjects was estimated based on an adherence to mediterranean diet rate of 50%.

130
Precision was set at 4.3% and percentage of losses at 10%. Population data is shown in Table 1

156
The hip circumference was measured as the maximum circumference around the buttocks. Based 157 on these two values, the waist-hip ratio was calculated using the cut-off points proposed by the WHO,

158
where normal levels of 0.8 are found in women and 1 in men, higher values indicating abdominal visceral 159 obesity, which is associated with increased cardiovascular risk (Jover E, 1997).

160
The calf circumference was measured in the widest section of the ankle-knee distance (cuff area)  17-23,5, and well nourished subjects obtain scores of twenty-four points and higher.

214
A measure of subjective weight is included by asking: "I consider that my weight is: A) higher than 215 normal, B) normal, C) lower than normal", following the model proposed in (Espina et al., 2001). Based 216 on the answer, the population is classified into three groups: "fairly subjective weight" those who believe 217 to be at an ideal weight, "more subjective kilograms" for those who believe that they are overweight and 218 "less subjective kilograms" for those who think they weigh less than they should.  Team, 2016) and the package mlr (Bischl et al., 2016) were used, which also allowed us to perform the 250 considered experimental design. In addition, another of the objectives pursued by this study was to find as 251 few variables as possible that would yield a performance value as high as possible, preferably at least 252 equal to that obtained using all available variables. This is basically a feature selection approach where 253 the main aims are the following: avoid overfitting and improve model performance, to provide faster 254 and more cost-effective models, and moreover to gain a deeper insight into the underlying processes that 255 generated the data as mentioned in (Saeys et al., 2007). There are three approaches in ML to perform this 256 process and the use of a filter approximation was chosen, for its velocity and independence of the classifier 257 (Saeys et al., 2007). In general, performing this feature selection process helps to reduce inherently the 258 present noise in such datasets.  and avoid the over fit that could occur. In particular, the following well-known state-of-the-art techniques 272 were implemented: Random Forest (Breiman, 2001), Support Vector Machines (Cortes and Vapnik, 1995;273 Vapnik, 1995), Elastic Net (Tibshirani, 1994;Zou and Hastie, 2005) and weighted k-Nearest Neighbours 274 (Hechenbichler and Schliep, 2004).

275
Random Forest (RF) (Breiman, 2001) is a state-of-the-art ML technique that was used in multiple 276 domains with good results. One of its main strengths is that the results obtained are very easy to  Support Vector Machines (SVM) (Cortes and Vapnik, 1995;Vapnik, 1995) is also one of the ML Manuscript to be reviewed Computer Science possible way (Burges, 1998). To achieve this goal, SVM introduces a particular mathematical concept 297 known as kernel: it is a mathematical function that allows the conversion of the input space into a higher 298 dimension, which is used to transform a non-separable linear problem into one that is separable. There 299 are different kernel functions, which in general could be interpreted as a measure of similarity between 300 two objects (60), and one of the most used is Gaussian Radial Basis (RBF), because basically any surface 301 can be obtained with this function (61). In this case, the domain of the parameters used to search for the 302 best model consists of a grid search of two different parameters. The first one (parameter C) is directly 303 related to the model and is used as a balance between the classification errors and the simplicity of the 304 decision surface, while the second (gamma parameter) is the free parameter of the Gaussian function and 305 in particular, SVM is very sensitive to changes in this parameter. For both parameters, and according to 306 the usual practice, values were evaluated in potencies of two between -12 and 12. To better understand this 307 technique, the following reading materials are recommended (Burges, 1998;Vert et al., 2004;Cristianini 308 and Shawe-Taylor, 2000).

309
Elastic Net (ENET) (Tibshirani, 1994;Zou and Hastie, 2005) is based on lasso (penalised least squares 310 method) and was specifically developed to solve some of the limitations encountered for this technique 311 (56). On the one hand, a grid search was performed on two different parameters, the alpha penalty following the maximum accumulated kernel densities the weighted k-Nearest neighbour are identified 320 (Hechenbichler and Schliep, 2006;Samworth, 2012). In particular, neighbouring values of less than or 321 equal to nine were used. Therefore, this particular and improved implementation of a k-Nearest Neighbour

325
The dataset has a total of 38 variables employed to characterise the differences underlying in the data 326 between high and low adherence to the MD. The data has been standardised using the z-score formula to 327 have a mean equal to zero and a standard deviation equal to 1. Four different ML techniques were used 328 to verify the results obtained, in an attempt to identify the technique that provides the best-performing 329 results. Initially, the analysis of the complete set of study variables is carried out. It can be seen in Figure   330 1.a and b how the techniques present a fairly stable behavior in the prediction. Even a simple a priori 331 technique such as KNN obtains the best results of the entire experimental phase, indicating that almost 332 all variables contain relevant information. In any case, in order to understand whether there is noise or 333 contradictory or correlated information that may be hindering the learning process of the algorithms, a 334 phase of dimensionality reduction will then be carried out.

335
Additionally, a process of feature selection was carried out to reduce the number of variables as 336 much as possible, so that the results could remain similar without statistical differences, if not better, for 337 those obtained using all variables. Our approach is a filter feature selection using a T-test to quantify the   The first model based on ML that was proposed for the prediction of the degree of adherence to the Mediter-391 ranean diet depended on information related to different anthropometric variables, socio-demographic 392 variables, nutritional status and self-perception of body image.

393
Initially, experiments with four different ML methods were performed and feature selection techniques 394 were applied to reduce the dimensionality of the problem. SVM is the best-performing model according 395 to the experimental design after a null hypothesis test, and our study found that using a feature selection 396 approach, the number of features could be drastically reduced to 16 (less than half of the initial number) 397 achieving an equivalent performance value in AUROC. The best model obtained was an SVM with an 398 RBF kernel as a decision function. The importance of each one of the predictors cannot be studied because 399 a nonlinear SVM is like a black box and the internal mapping function is unknown. Furthermore, the