Cluster analysis to the factors related to information about food fibers: A multinational study

The adequate intake of dietary fibers is essential to human health. Hence, this study intended to evaluate the level knowledge of about food fibers and investigate what factors might be associated with it. A descriptive cross-sectional study was conducted on a sample composed of 6,010 participants from ten different countries. The survey was based on a questionnaire of self-response, approved and complying with all ethical issues. The data collected were subjected to the factor analysis (FA) and the cluster analysis (CA). Validation was done by splitting the data set into two equal parts for confirmation of the results. FA concluded that ten from the 12 variables used to measure the knowledge about dietary fibre (DF) should be grouped into two dimensions or factors: one linked to health effects of DF (α = 0.854) and the other to its sources (α = 0.644). CA showed that the participants could be divided into three groups: Cluster 1 – good knowledge both about sources and health effects of DF; Cluster 2 – good knowledge about the sources of DF but poor knowledge about its health effects; Cluster 3 – poor knowledge both about sources and health effects of DF. The data were appropriate for the analysis by means of factor and CA, so that two factors and three clusters were clearly identified. Moreover, the cluster membership was found varying mostly according to the country, living environment, and level of education but not according to age or gender.


Introduction
Dietary fiber (DF) is originally present in plant foods and comprises polysaccharides and lignin, which resist hydrolysis processes by the enzymes present along the digestive system in humans. They include cellulose and hemicellulose or modified celluloses, gums, mucilages, lignin, and finally pectins and oligosaccharides (Nieto Calvache et al. 2015; Sumczynski et al. 2015). Previously, DF was classified as being soluble or insoluble according to the physiologic effects that were produced by different types of fibers. Nevertheless, the usage of such terminology was discouraged by many agencies and/or organizations such as the Institute of Medicine report and the National Academy of Sciences Panel on the Definition of DF (Slavin 2008(Slavin , 2005. It has been shown that DF originating from diverse sources can have different physiological and metabolic effects. This is owing to the fact that DF comprises many different macromolecules, each one with distinctive physical and chemical characteristics. For example, the ion exchange capacity and the viscosity are intensely associated with the metabolism of sugars and lipids, while other properties such as particle size and granulometry, fermentation pathways, and bulking effects are powerfully associated with the functions in the colon (Guillon and Champ 2000).
Typically, foods like fruits and vegetables or whole grain cereals, seeds, and nuts are very rich in DF, being cereals undoubtedly the most relevant usual sources of DF. However, the recognition that the DF from fruits and vegetables is of higher quality has contributed for an increased consumption of these foods (Hincapié et al. 2014 ; however, a joint communication from WHO/FAO recommends a minimum of 25 g/day of DF, which should, however, be obtained from different sources as much varied as possible (Carvalho et al. 2009; Martinho et al. 2013). Human diets have been changed over the time, and recently, there is an increased consumption of refined cereals, meats, added fats especially saturated fats, and refined sugars, while less consumption of proteins of vegetable origin and DF (Hall et al. 2010; Kendall et al. 2010;O'Neil et al. 2010). Furthermore, in general, diets poor in fiber are normally also poor in micronutrients essential for the human body (like vitamins, dietary minerals, or phytochemicals) and high in sugars, salt, rapidly digested starches, and fats, all factors that contribute for an unhealthy diet (Mann and Cummings 2009). This change in diet, associated with other factors, is greatly responsible for the increasing incidence of many diseases (Mann and Cummings 2009;Kendall et al. 2010).
DF has been recognized as being healthy for humans since long ago, so much that the European Food Safety Authority allows health claims about its proved benefits (Mackie et al. 2016). Several beneficial health effects have been attributed to an adequate intake of DF, as demonstrated by countless scientific studies, both in vivo and in vitro, thus DF being an essential component of a healthy diet (Macagnan et al. 2015). The benefits of DFs are extended to the treatment and the prevention of diseases such as diverticular disease, inflammatory bowel disease (Crohn's disease), constipation, cardiovascular disease, obesity, hyperlipidemia, hypercholesterolemia, hyperglycemia, and gastrointestinal-related types of cancer (Kendall et al. 2010;Kaczmarczyk et al. 2012;Stephen et al. 2017). Furthermore, DF demonstrated capacity to exchange many cations and especially some toxic ones, thus helping to eliminate them in the feces and also absorbing some dangerous substances like heavy metals or pesticide residues associated with disease (Hong et al. 2012).
Notwithstanding the positive effects of DF mentioned earlier, some studies also alert to possible negative effects associated with the ingestion of DF, for example, some possible interference with the absorption of compounds like minerals or vitamins (Hernández et al. 1995). Still, it is not probable that adults with a good health status who consume DF according to the recommendations might experience problems related to the absorption of nutrients (Slavin 2008).
Because the health benefits related to DF come directly from an adequate consumption, having in mind the recommended dosages, the people's attitudes are fundamental to effectively consume diets that provide them with the correct amounts of the necessary nutrients and bioactive compounds or functional ingredients. However, the dietary patterns are not always the most adequate for many reasons, such as lack of time, stress, and social constraints or simply due to insufficient information. Therefore, it is believed that knowledge may alter people's behaviors toward healthier food choices. Measuring the level of knowledge about DF may constitute a way of inferring about a better or worse global involvement in consuming higher amounts of healthier foods such as vegetables and fruits as well as whole cereals instead of refined ones. These attitudes may at long term produce good results in terms of better public health and lower the costs associated with some chronic diseases due to poor healthy eating habits (Martinho et al. 2013;Ferreira et al. 2016).
The objective of this study is to characterize the clusters that have the level of knowledge demonstrated about DF, assessed through correct or incorrect answering to questions related to DF, on a sample of people from 10 different countries from three continents (Europe, America, and Africa). The knowledge about DF was evaluated based on the following aspects: animal/vegetable origin of DF, the richness in DF in foods made from whole cereals or in fruits with peel, the possible benefits of DF for an improved health in general or for some particular diseases such as cardiovascular diseases, obesity, diabetes, constipation, or some types of cancer. Furthermore, the variables related to knowledge about DF were grouped into some factors, which were subsequently used to aggregate the participants into different clusters according to their knowledge. The identification of the clusters and their characterization may give important guidelines for planning educational programs and promoting healthier food choices.

Instrument
The questionnaire that was used for the survey was structured into different parts, beginning with a first part about the sociodemographic characteristics (age, gender, level of education, country, and living environment) and then sections with questions about DF and its influences on human health. The respondents were asked to answer on a Likert scale with five points between one (corresponding to totally disagree) and five (corresponding to totally agree). The sentences included in the questionnaire and considered for this study (some of them after inversion) are presented in Table 1

Data collection
The methodological study was conducted through a survey by means of a questionnaire applied to a sample of over 6,000 participants from ten different countries from Europe, America, and Africa. The countries participating in the study were as follows: Argentina, Croatia, Egypt, Hungary, Italy, Latvia, Macedonia, Portugal, Romania, and Turkey, which had been working together on a multinational framework about food fibers.
In each country, the data collection included people from both genders, with different levels of education and different living environments. It was intended to obtain a sample as munch diverse as possible in all countries to become representative of each reality and social involvement. Nevertheless, the selection was by convenience in all participating countries. The questionnaire was applied by direct interview only to adult citizens, and each participant answered the questionnaire voluntarily after giving verbal informed consent. It was taken care to include different sectors of the population, like age, level of education, sex, or marital status. In addition, geographical area of residence was also considered, so that people from different cities and smaller villages were included from each participating country. All the answers provided by the participants were kept anonymous, and the personal data were not collected, so it would be impossible to relate answers with participants individually, thus protecting the privacy of the participants. All ethical concerns were taken into account when planning the research and while applying the questionnaire or when treating the data. The research was previously approved by the ethical committee (REF. 03/2015).

Statistical analysis
The techniques used in this study were factor analysis (FA) and cluster analysis (CA). First, exploratory FA was applied using the principal component analysis (PCA) methodology to observe if there was any kind of aggregation structure between different statements relating to the knowledge about DF. The factors identified were subjected to CA by different methods, some hierarchical and some partitive, to perceive if a cluster structure would come out to classify the participants surveyed.

CA
Five hierarchical methods were used using the two factors obtained by FA (average linkagebetween groups, average linkagewithin groups, complete linkagefurthest neighbor, centroid, and ward). This procedure allowed estimating the most adequate number of clusters to form based on the evaluation of the coefficients obtained in the agglomeration schedule. These solutions were subsequently compared by means of contingency tables to verify potential stability.
After fixing the number of clusters in three, the partitive method of k-means was used, because it is particularly recommended and frequently used in CA (Dolnicar 2002). The application of k-means was made to those solutions that appeared to be more stable, thus eliminating the solution obtained by the method complete linkagefurthest neighbor. The results showed that the four initial solutions tested converged all to the same final solution, and this was later analyzed for stability by dividing the initial database into two parts for repetition of the CA procedures. According to Dolnicar (2002), this repetition constitutes an easy way to evaluate the confidence in the results of CA.
Software SPSS, form IBM Inc, version 22 was used for all the analyses. Figure 1 presents a flowchart explaining schematically the statistical procedures followed in the analyses.

Sample characterization
This study was carried out simultaneously in 10 countries situated in three continents (Europe, America, and Africa), presented in Table 2, which includes a characterization of the sample per country.
The majority of the participants were women (65.7%), with 34.3% of men. The age of the participants varied from 18 to 84 years, being on average 35 ± 14 years, although the average age of the women was slightly lower (34 ± 13 years) when compared to the average age of the men (37 ± 14 years). The results presented in Table 2 further show that, in general, the participants from Egypt were younger (aged 25 ± 9 years), while the participants from Macedonia were of average age (41 ± 13 years).
Most of the participants evidenced a high level of education (55% had completed a university degree), whereas 42% had completed secondary school and only 3% had completed the lowest level of education (primary school). This trend was observed for most countries, with exception for Italy, where most of the participants (∼70%) had secondary school, followed by Romania, with 47% of participants with the secondary school.
Most of the participants lived in an urban environment (80.2%), while 19.8% lived in rural areas. In most countries, the majority of the participants were from urban zones, but in the case of Egypt, most of the samples were from rural areas.

Evaluation of adequacy of data
The correlation matrix confirmed that there were some associations between the variables, with 20 values higher than 0.4. The highest value was 0.627, which  corresponded to the correlation between the variables V-6 and V-7. The values reflect some important correlations between the variables, thus making possible to apply the technique of FA. Also the results of the Bartlett's test confirmed that FA could be applied to this problem because the p-value was significant (p < 0.001), hence leading to the rejection of the null hypothesis H0: "The correlation matrix is equal to the identity matrix." The KMO value was good (0.850) according to the classification proposed by Kaiser and Rice (1974), thus confirming the suitability of the data to be submitted to PCA and FA. The analysis of the anti-image matrix ( Table 3) revealed that none of the values of measure of sampling adequacy (MSA) was less than 0.5, which implies that all the variables were adequate to be included in the analysis.

FA solution with Varimax rotation and extraction by PCA
The rotated solution obtained from the analysis by FA with PCA resulted in three components according to the Keiser criterion to include eigenvalues greater than 1 (3.642, 2.328, and 1.828 in the present case), and this was also confirmed by the scree plot (graph not shown). The percentages of total variance explained by the three factors were as follows: F1, 27.9%; F2, 17.8%; F3, 14.0%, with a total variance explained of 59.7%. The variable V-1 had the largest fraction of its variance explained by the solution, corresponding to 79.7%, followed by variable V-4, with 75.1% of the variance explained. Only the variable V-10 had a lower communality (0.388), thus indicating that only about 40% of its variance was explained by the solution extracted by FA, and all other variables had communalities higher than 0.400.
The rotation algorithm converged in four iterations and produced three factors ( Table 4). One factor (F1) was clearly linked to the associations between DF and different benefits for the human health; the other factor (F2) related to the statements about the origin of DF; and the third factor (F3) was associated with the statements that referred to the foods with higher content of DF. In relation to factor 1, all loadings were relatively high, with the lowest being 0.595 for variable V-10 followed by 0.612 for variable V-11, thus indicating that the answers obtained for the effects of DF on constipation and breast cancer did not contribute so strongly for the definition of this factor as those variables with higher loadings, such as the case of variables V-7, V-6, and V-9 (with loadings of 0.786, 0.756, and 0.746, respectively), thus factor 1 being more strongly associated with the effect of DF on cholesterol cardiovascular diseases and obesity. The results presented in Table 4 also reveal that the correlations of those variables most strongly linked to factor 2 are considerably higher (greater than 0.8) when compared to that of those variables most toughly linked to factor 1 (less than 0.7), indicating that people were generally aware of the plant nature of DF. Finally, factor 3 also had variables with high loadings (0.703 and 0.866, corresponding to the content of DF in whole foods and in fruits with the peel, respectively). Since all the variables with loadings higher than 0.4 were encompassed in the solution, this is a satisfactory solution when including all the 12 variables (Stevens 2009). Finally, this solution produced a grouping pattern that can be easily interpreted.

Validation of the solution by Cronbach's alpha
The validation was achieved by calculating Cronbach's alpha (α), which determines the internal consistency

Hierarchical clustering analysis (agglomeration methods)
The CA was applied to the data obtained by FA, but considering only factors 1 and 2, in view of the results obtained for Cronbach's alpha for factor 3, which meant that this factor was not consistent. The CA was applied by different hierarchical methods to determine the most adequate number of clusters: average linkagewithin groups, average linkagebetween groups, complete linkagefurthest neighbor, centroid, and ward. Figure 2 presents the coefficients (corresponding to the distances) as a function of the number of groups obtained by two of the methods (centroid and the average linkage within groups). The last 20 values obtained in the agglomeration schedule were used since  others were considerably smaller and therefore negligible. Both graphs shown in Figure 2, as well as others corresponding to the others methods (not shown), suggest the formation of three groups because the coefficients after that already present a tendency to stabilize, thus concluding that the ideal numbers of clusters was three. The solutions obtained with the five hierarchical methods for the case of three clusters were subsequently compared by means of contingency tables, being the results for the expected similarities between the solutions shown in Table 6. The values of the percentages indicated that the solutions obtained by the methods centroid and average linkage (between groups) were the most similar, with a very high percentage of the cases allocated to the same clusters (96%). Also the solutions WARD ad average linkage (within groups) present a high similarity (87%).

K-means clustering analysis
From the obtained results, it was concluded that the advised number of clusters was 3 and the possible initial solutions to use by the k-means method are centroid, ward, average linkage (within groups), and average linkage (between groups) due to the high similarity, indicative of potential stability.
Conveniently, the k-means method applied to the four different initial solutions obtained by the hierarchical methods converged to an equal final solution after less than 25 iterations (Table 7), with the same cluster centers ( Table 8; just with different cluster number). The fact that all four initial solutions converged into a same final solution is indicative of stability. The values of the statistic F in ANOVA are high, thus confirming the resemblance between the cases within the groups and the dissimilarities between groups. The values of F further show that the two factors equally contribute for the discrimination of the groups because they are of the same order of magnitude for both factors: Factor 1: DFH -DF and health; Factor 2: ODF -Origin of dietary fiber ( Table 7). In the final solution, two of the clusters gather approximately 2,000 members (2,069 and 2,044, more precisely), while a third cluster has slightly less members (1,743).

Analysis of stability
To evaluate if the solution was stable, the database was separated into two parts that were then treated separately, with a random selection of cases for each half. The techniques used were similar to the treatment applied to the global data set, but in this case, the number of cluster was already fixed as 3, and only one initial solution was used for the k-means (obtained by the method ward). Table 9 presents the results obtained for each of the halves together with those for the global solution to allow an easier comparison. The convergence was achieved for data sets, and the values of F are high in all cases (varying from 1323.8 to 4762.1), being very similar between the two parts. Also the final solutions resulting from the analysis are considerably similar   taking into account the group central coordinates and the composition of each group ( Table 9). The graph shown in Figure 3 shows the location of the centers of the three clusters regarding the global data set as well as both parts, which are basically coincident. Thus, the splitting of the whole data set allowed obtaining the same solution, thus confirming the previously noticed trend for stability.

Interpretation of the results
The results of the final solution (whole sample) are presented in Figure 4. Cluster 1, which corresponds to 35% of the cases, had a high positive value of F2 (related to knowledge about the origin of DF) and a low positive value of F1 (knowledge about the health benefits of DF). This indicates that these individuals have a very good knowledge about the origin of DF (high above the average, corresponding to the origin of the referential) and a reasonable knowledge about the health benefits (slightly above the average). Cluster 2 also includes 35% of the cases and corresponds to a positive F1 but negative F2, thus indicating individuals with a knowledge above average about the health benefits of DF but under the average about its origin. Cluster 3 corresponds to 30% of the cases, and both values for F1 and F2 are negative, indicating a lower than average level of knowledge regarding either the origin or the health effects of DF. Therefore, the groups can be described as follows: • Cluster 1good knowledge both about sources and health effects of DF. • Cluster 2good knowledge about health effects of DF but poor knowledge about the sources of DF. • Cluster 3poor knowledge both about sources and health effects of DF.

Cluster characterization
Regarding age, cluster 1 had higher average age of its members (36.5 ± 13.6 years), followed by cluster 2 (34.0 ± 13.5 years) and finally by cluster 3, with the lowest   average age (32.7 ± 14.0 years). As for gender, all clusters were mostly composed of women, representing 69.9% in cluster 2, 66.6% in cluster 1, and slightly less in cluster 3 (59.5%).
The association between cluster membership and level of education is presented in Table 10. While clusters 1 and 2 were mainly composed by people with the highest level of education (university degree) (59.3% and 57.8%), in cluster 3, most of the individuals had a secondary level of education (48.9%), although closely followed by those with a university degree (46.6%).
These results seem to indicate that the level of education influenced the level of knowledge about DF since the lowest knowledge found for the individuals in cluster 3 could be attributed to their lowest educational level.
Regarding the living environment, most of the members of the three clusters lived in an urban environment although these percentages were higher for clusters 1 and 2 (85.1% and 80.4%, respectively) compared to cluster 3 (75.1%). Again, the level of knowledge seems to be related to the living place because cluster 3 shows the lowest knowledge and the percentage of people living in rural areas is higher.
The association between country and the cluster membership is presented in Table 11. The results show that the participants from the different countries are

Discussion
The data collected in the present survey were suitable for application of FA based on the correlation matrix, the KMO value, and the Bartlett's test of sphericity (Broen et al. 2015). All procedures for factor and CA were followed, and the data were fairly described by the factorial solution according to the sources of DF or its health effects, as indicated by the results presented in Table 5, i.e., high loadings for variables V-5 to V-12 in factor 1 and for variables V-1 to V-2 for factor 2.
Regarding the origin of DF, some people seem not fully aware that DF comes from plant foods and much less what type of compounds it comprises (Martinho et al. 2013). It is difficult for people to know that DF consists of lignin and polysaccharides and resist the hydrolysis by enzymes present along the human digestive system (Nieto Calvache et al. 2015; Sumczynski et al. 2015). However, it is less complex for people to identify some typically DFrich foods such as whole grain cereals, fruits, vegetables, or nuts and seeds and undoubtedly that cereals are most recognized sources of DF, and people may know the difference between whole and refined cereals. Still their choices may not always be according to their inner knowledge but more to their preferences and habits (Guiné et al. 2016b). The knowledge about the sources of DF was adaptively assessed through statements V-1 to V-4 in Table 1. However, some of the statements were not included in the final solution (Table 5), namely, those regarding the comparison between foods made from whole cereals and from refined cereals or the fact that fruit peels are quite rich in DF. This indicates that the way the participants answered those particular questions was not robust enough revealing a possible lack of knowledge about those aspects.
The prevention of constipation (Martinho et al. 2013) is one of the better known health benefits related to the consumption of DF. Nevertheless, much more effects can be attributed to DF, namely, preventing diseases affecting the intestine and colon, blood glycaemia, cardiovascular diseases, and serum cholesterol or cancer affecting the gastrointestinal system (Kendall et al. 2010;Kaczmarczyk et al. 2012). The knowledge about these effects is somewhat variable, but still far from high levels, as demonstrated in a previous study undertaken solely in Portugal aimed at investigating the same issues (Martinho et al. 2013). Nevertheless, it is important to notice that it is somehow very difficult to identify what would be a satisfactory level of knowledge expected for the populations because there are no guidelines about it. While FAO and other similar organizations recommend daily intake of DF, because they are easily assessed trough nutrient calculations, the level of knowledge about any subject is something far more difficult to measure and much less to define a standard level or a desired level. When people demonstrate higher knowledge about any subject, they are able to make more sustained choices, and this is valid for many situations not only eating habits (Dixon and Burton 2014; Hoek 2015; Ghanouni et al. 2016;Salomaa et al. 2016). In the particular case of the knowledge about DF, it is believed that it may have important consequences for public health, giving the many benefits associated with its consumption. In this study, a significant part of the participants demonstrated a high level of knowledge about the health benefits of DF, factor 1, corresponding to the members in clusters 1 and 2, summing 70% of the participants (Table 9). Hence, and given the wide coverage of the study undertaken, one could positively infer that in general people are aware of the positive effects of DF for improving the human health.
There are many factors that can contribute for adoption of less healthy lifestyles and eating patterns, The findings from this study suggest that the data were appropriate for CA, and the sample was divided into three clusters according to their level of knowledge on the sources of DF and their health benefits. The CA allowed concluding that the participants in this study could be grouped into clusters revealing a high to low level of knowledge about DF, according to the scale defined in this study. The results showed that while there were very positive indicators about the knowledge concerning the health effects of DF, with 70% of the participants demonstrating a positive value for F1factor related to the health effects of DF (clusters 1 and 2), in what concerned F2factor related to the origin of DF, only 35% of the participants revealed a positive value of F2 (cluster 1) ( Table 9 and Figure 4). These variations could be attributed to the level of education or the living environment more than any other variables evaluated. The people who revealed a higher level of information tend to have higher levels of education and live in urban areas. The more educated people tend to be more curious, critical, and worried about all aspects related to their living including information about what they eat and what this brings them (Guiné et al. 2016c). Because the variables influencing the knowledge were identified as living environment and qualifications, future actions to disseminate information about these topics could be designed to target subjects according to those factors. The effectiveness of educational programs intended to improve the quality of life through a better diet based on the knowledge about health benefits of certain foods certainly relies on the ability to reach the target population. This study may help reaching in a more efficient way the people according to their living environment and level of education, with emphasis on those living in a rural environment and with lower education levels. Still, it is important to plan any interventions according to the particular characteristics of the participating countries, namely, in terms of culture, education, and health policies. This study focused on the knowledge about sources and benefits of DF, and we might enable new highlights on how to improve knowledge that does not necessarily imply a direct or easy change in behavior or practices.
This research involved different countries and the choice intended to cover different regions of the globe; however, because it was not possible to include more countries in the survey, the ones included were selected by convenience. Because this study was developed in several countries from different parts of the globe, namely North, Central and South Europe, North Africa, and Latin America, its conclusions could be somewhat understood as with a worldwide cover. A limitation of this study relates to the heterogeneity in the number of cases obtained in each of the participating countries, thus leading to different representations although global the number of participants was high, over 6,000.

Conclusion
The FA permitted concluding that from the 12 variables initially used to assess the level of knowledge about DF ten of them could be grouped into two factors, the first associated with knowledge about health benefits of DF and the second with the knowledge about the sources of DF. This grouping structure of the variables was subsequently subjected to CA, concluding that the participant in this study could be distributed between three groups base on their knowledge about DF. The first group included people with a good level of knowledge both about the sources and the health benefits of DF; the second group included people with a good level of knowledge about the health effects of DF but a poor level of knowledge about the origin of DF; and finally the third group included people with a poor knowledge both about the sources and the health effects of DF. It was further observed that the level of education, the country, and the living environment of the members of the three clusters could be linked to their level of knowledge.