Women’s Perception of Attractiveness of Men’s Faces Inversely Correlates with Men’s Serum Testosterone During the Fertility Phase of the Menstrual Cycle

The attractiveness of the human face may signal the genetic suitability of a mate. The ‘ovulatory shift hypothesis’ postulates that women in the fertile phase of the menstrual cycle prefer faces of masculine men that signal ‘good genes’, whereas in the non-fertile phase they prefer good parental providers. We studied relationships between serum total testosterone and face attractiveness of 77 healthy men (20-29 years, mean±SD 22.44±1.79) as rated by 19 healthy women (20-27 years, mean±SD 22.84±1.96) on day 13 of their menstrual cycle. Using advanced Bayesian multilevel modeling we showed that the attractiveness of faces is negatively associated with the concentration of serum testosterone in the men, even taking into account the concentration of serum estrogen in the raters. The average face composited from images of 39 faces rated above pool median attractiveness rate, was slightly narrower than the average face composited from 38 less attractive faces. Our results challenge the ‘ovulatory shift hypothesis’ as faces of males with higher circulating testosterone were rated as less attractive than faces of males with lower testosterone by women on the fertile phase of the cycle.


Introduction
The attractiveness of the human face may signal the genetic suitability of a potential mate. Facial Several mating selection theories are trying to explain women's preferences [22]. A recent hypothesis called the "ovulatory shift hypothesis" [23][24][25][26][27][28] postulates that women's preferences change across the menstrual cycle. Speci cally, women desire men with traits signaling 'good genes' in the fertile phase, whereas in the non-fertile phase they prefer 'good fathers' with traits signaling willingness to provide parental care. As men with 'good genes' may not be willing to invest in a long-term relationship [29,30], women tend to obtain these 'good genes' through short-term sexual affairs during the fertile phase, regardless of long-term relationships. Such a cyclic shift of preferences supposes to justify women's desire to obtain the best genes for their children and to ensure long-term access to material resources from 'good fathers'.
The ovulatory shift hypothesis is being disputed [31][32][33][34][35]. Very low rates of cuckoldry at the range from 0.73 to 3% in western societies and historical populations challenge the main concept of the hypothesis [36,37]. The theory assumes that women can identify 'good genes' based on men's phenotypic testosterone-related features such as facial masculinity and dominance [38]. Testosterone is supposed to suppress the immunological system and only men with 'good genes', which encode resistance to common pathogens, would afford to maintain a high testosterone concentration and develop masculinity features without being infected [39,40]. However, dominance [41], facial masculinity [42,43], and higher testosterone are not related to superior actual health or stronger disease resistance [44,45] and overall mating success [46]. There is little support for claims that health increases mating success in relatively healthy humans [47]. This negates the role of testosterone as an indicator of 'good genes'.
Given the importance of the males' testosterone in the mating theory, we examined relationships between men's serum concentration of testosterone and women's preferences for men's faces on the fertile day of their menstrual cycle. We also tried to identify preferential face features by graphic comparison of composite images obtained from groups of attractive and unattractive men as rated by women.

Study design
This was a prospective analytical observational study to explore associations of the attractiveness of male faces rated by female participants in the 13th day of their menstrual cycle, with male's serum total testosterone (TT) taking into account the female's serum sex hormones concentration. The protocol of the study was approved by the Bioethics Committee of the Medical University of Bialystok, Poland. Each participant gave written informed consent to take part in the study as well as for the publication of identifying images in an on-line open-access publication and was not paid. All research was performed in accordance with the Declaration of Helsinki.

Population and study participants
We recruited healthy women based on the most recent general medical examination and standard laboratory tests, with no major health problems in the past, not pregnant, not used hormonal contraceptives, who had regular, non-complicated menstrual cycles (from 25 to 35 days) in the last 6 months, were heterosexual, never smoked and never took drugs, had not abused alcohol and were not taking any concomitant medication. Only participants 19 to 31 years old were included to capture those with the highest reproductive potential. We included men based on self-reported health status and no smoking habit. Women were not familiar with men.
Hormonal status A blood sample for measurements of 17β-estradiol (E2) and progesterone (PG) concentrations was drawn on 13th day of the menstrual cycle, after 12 hours of fasting, 72 hours abstain from vigorous physical exercises and alcohol intake. In males, a blood sample for measurements of TT concentrations was taken at the university hospital laboratory between 7 a.m. and 9 a.m. after 12 hours of fasting, 72 hours abstain from vigorous physical exercises and alcohol intake. The concentration of TT was measured according to a standard clinical practice using Chemiluminescent Microparticles Immunoassay and device by Architect ci100 (Abbott, USA), and Alinity 2nd Generation Testosterone Reagent Kit (Abbott, Germany). Intra-assay coe cients of variation (CV) given by the manufacturer were 3.5%, 2.5%, and 2.3% for mean TT concentration of 0.32, 2.53, and 8.43 nmol/L of in control samples, respectively. Abbott reports an inter-assay CV of 8.1%, 3.8% and 2.6%, respectively, for analogous low, medium, and high TT-level of control samples. Intra-and inter-assay CV for human serum panel was calculated at 2.8% and 3.1%, respectively, for mean TT concentration of 2.30 nmol/L. The assay is linear across the measuring interval of 0.15 to 64.57 nmol/L. We employed ARCHITECT Estradiol and Progesterone (Abbott, Ireland) kits to measure E2 and PG, respectively.

Assessment of face attractiveness
A professional photographer took frontal view pictures of male faces in standardized studio conditions on a white background and the same lighting conditions. Clean shaved male participants were asked to maintain a neutral facial expression with their mouths closed while seated upright on a chair. They were photographed using a professional digital camera (ILCE-6500, Sony, Japan), with a horizontal and vertical picture resolution of 350x350 DPI and matrix size 6000x3376 pixels, with an exposure time of 0.01s. Flashlights were assembled to yield relatively natural lighting conditions. The distance of each face from the camera was constant at 1.8m.
We used a program, developed locally for this purpose (Inspeerity, Bialystok, Poland), to present pictures of faces on a computer screen and capture assessment of attractiveness by each female rater in an ordinal scale from 1 to 10 (10 indicates the most attractive). The raters were instructed how to operate the program and they assessed 10 phantom faces as a warm-up just before starting the trial. Thereafter, pictures of all faces were displayed in one session to each rater independently, in random order and in standardized conditions between 8:00 and 12:00am. Seven seconds were allotted for each face assessment while time ow was graphically displayed on the bottom of each display.

Creation of composite images of faces
We used programs written in Python programming language (OpenCV, https://opencv.org; and Dlib, http://dlib.net) to combine images of individual faces and generate composite faces for comparisons. First, we detected and extracted facial landmarks (salient regions of the face, such as: eyes, eyebrows, nose, mouth, jawline) using Multi-PIE land-marking scheme from iBUG 300-W dataset from every image [48,49]. We applied the pre-trained face shape predictor to the face on an image, to estimate the location of 68 (x, y)-coordinates that map to facial structures on the face. The predictor was trained on the iBUG 300-W face landmark dataset. We used the facial landmark detector, which is an implementation of the One Millisecond Face Alignment with an Ensemble of Regression Trees approach [50].
In the second step, we normalized the faces and placed them in the same reference frame. Then we aligned the facial features of all input images. Subsequently, we calculated the average of all landmarks in the output image coordinates by simply averaging the x and y values. Then we used 68 feature points of each image, as well as points on the boundary of the image, to calculate a Delaunay Triangulation, which allows us to break the image into triangles. Then we aligned these regions to the average face. The vertices of the corresponding triangles were used to calculate an a ne transform (the composition of transformations such as translations, scaling, and rotations). This a ne transform then was used to transform all pixels inside the triangle. This procedure was repeated for every triangle in the input image [51]. To calculate the average composite image, we averaged pixel intensities of all warped images.

Statistical analysis
We employed a statistical software (SYSTAT 12, Systat Software, USA) to perform basic data analyses.
To rank facial attractiveness we used the median value of all scores assigned to the individual male face by raters, and then we classi ed all faces into attractive and non-attractive ones based on the median value from those assigned median values to each face. We calculated mean, median, and SD of serum hormone concentrations. We used the Shapiro-Wilk test to determine the normality of the continuous data.
We used advanced Bayesian multilevel modeling (BMM) to model the association between rate, which is the primary outcome variable (response variable), and 1) TT, and 2) TT adjusted for effects of E2, as exploratory variables. The ordinal response is assumed to originate from the categorization of a latent continuous variable. Cases (data points) from the same rater may be correlated. BMM can model the data measured on different levels at the same time, thus taking a complex dependence structure into account. In multilevel modeling, the response variable y follows a probability distribution from a family, y_i ~ D(f(η_i ), θ), where y¬i is the value of y of case i, θ is additional model parameters. The linear predictor can be written as η = Xβ + Zµ. In this equation, β and µ are coe cients at the population-level and cluster-level, respectively. X and Z are the corresponding design matrices, µ follows a multivariate Gaussian distribution with mean zero and an unknown covariance matrix. Model parameters are (β, µ, θ).
In the rst multilevel model, the rate was the response variable with family cumulative and TT was a predictor. We also incorporated an intercept varying by raters. That is, our model was Rate ~ TT + (1|rater). In the second model, we included TT and E2 and incorporated an intercept varying by raters.
In BMM, the inference is based on the posterior distributions which are generated by Markov Chain Monte Carlo (MCMC) [52]. Relative to non-Bayesian multilevel models, BMM can draw inference for any pattern based on posterior distributions. The classic MCMC methods converge slowly. To address this limitation, we used Hamilton Monte Carlo, which converges quickly because of its ability to avoid the random walk behavior [53]. The MCMC parameters are: 4 chains, each with 2000 iterations, and the rst 1000 are warm-up. Histograms ( gure 1) shows skewed distributions of rates and TT, thus we applied a log 10 transformation to ful ll the condition of Gaussian distribution in modeling.

Results
We found a signi cant association of males' serum TT concentrations with attractiveness ratings. The coe cient of TT was -1.05 (estimate error was 0.3, 95% con dence interval (CI) was [-1.64, -0.48]). The higher the TT concentration, the lower attractiveness score men received from women ( gure 2).
After adjusting for E2, the association of TT with scoring was still signi cant. For TT, coe cient was -1.05 (estimate error 0.29, 95% CI: -1.63, -0.50). The effect of E2 was marginal, coe cient was 0.56 (estimate error 0.52, 95% CI: -0.50, 1.54). Lower scores of face attractiveness were assigned to images of faces of men with higher TT concentration, regardless of the E2 concentration in female raters. To facilitate geometric comparison of composite faces we measured the width and length of composite faces in relation to the composite face from 77 faces ( gure 4). The more attractive faces were slightly smaller than average while the less attractive were slightly bigger.

Discussion
We have shown that faces of men with a higher concentration of circulating testosterone were rated as signi cantly less attractive by young women, on the 13th fertile day of their menstrual cycle. This result challenges the hypothesis that women in the fertile phase would prefer masculine, high testosterone men [54]. If such men would have been preferred by hundreds of generations of ancestral women, as the hypothesis suggests, androgen associated traits such as the massive jaw, cheekbones, and prominent brow ridges, increase musculature, and dominant aggressive behavior would have been ampli ed by forces of natural intra-sexual selection [54,55]. In contrast, several archeological and anthropometric studies provided compelling evidence that the Homo genus, which includes Humans, Neanderthals, and other ancestors, has experienced an evolution toward smaller faces over time, with Homo sapiens showing the greatest reduction in size [56,57]. Fast shrinking face, the rapid evolution of mandibular shape and size, and rampant brain growth are suggested from analysis of numerous human fossils [57][58][59][60]. Hence the forces of evolution worked against face masculinization and rather favored the development of cognitive abilities.
We showed that the composite attractive male face is slightly smaller than the less attractive one. To the best of our knowledge, we provide for the rst time empirical support for the inverse relationship between serum TT in men and their facial attractiveness as scored by young women during their fertile phase. In contrast, Roney et al. reported that women prefer natural faces of men with high saliva testosterone [71]. Their raters scored men's faces in random days of their menstrual cycle, however, and their fertility window was merely guessed based on the calendar method and estimates of hormonal concentration from data published in other reports. Despite each woman rated faces only once Roney et al. did not provide crucial information on the distribution of ratings across the cycle. Especially, they did not provide how many women were in the fertile days. Also, we studied Caucasian men and women while Roney et al.'s studied samples of different ethnic groups.
We measured serum concentration of TT using a standardized well-validated clinical methodology between 7 a.m. and 9 a.m. because the concentration can fall up to 30-35% in the late afternoon [72]. Roney et al. measured saliva 'free' testosterone at various times of the day and regressed transformed testosterone concentrations onto the individual rates of the attractiveness of faces. A simple linear regression may not be the right tool to explore the association of ordinal correlated scores of unknown distribution with testosterone values. Furthermore, they did not explain how regression coe cients and variance from all 75 individual linear regression analyses were pooled.
A more sophisticated statistical approach is needed to model positive correlation among raters' attractiveness scores of the presented facial images, which could be the result of sequentially dependent attractiveness perception or sequentially dependent response bias [73,74]. For example, if a rater's scoring criterion gradually changes over time (e.g., the rater tends to give higher ratings at the beginning of the experiment and lower ratings at the end of the experiment), then the autocorrelation of the rater's scoring criteria will lead to a positive correlation between current and previous ratings.
Peters et al. also investigated an association between saliva testosterone and attractiveness and masculinity of face and body on photographs of 119 young men, rated by two independent groups of 12 females [75]. Most of these women were taking hormonal contraceptives that could have altered any potential menstrual cycle effects on ratings. Testosterone was not correlated with either attractiveness or masculinity, however, they implied that the correlation between testosterone and attractiveness was more likely to be negative. Also, Neave et al., who used natural photographs of 48 men and 36 female raters, did not nd an association between salivary testosterone and attractiveness or masculinity [76]. They did not report, however, at which phase of the menstrual cycle their female raters were during scoring.
We measured the concentration of serum TT, while most investigators rely on measurements of 'free' testosterone in saliva [71,75,76]. Despite the speed, easy saliva collection, and avoidance of stress of vein puncture, multidisciplinary clinical experts recommend serum TT as the rst-line, reliable indicator of the physiological function of gonads, with low analytical variation (precision about 4-10%) and close correlation with calculated bio-testosterone and free testosterone [77,78]. It is uncertain if saliva testosterone represents the so-called 'free' fraction, due to its binding to albumin, proline-rich proteins, and steroid hormones-binding globulin [79,80] We allowed for each rater 7 seconds to assess each photograph of men's face to capture the rst impression of attractiveness, which is most likely rapid, automatic, and mandatory [87]. Rates could have reliably judged the attractiveness of faces presented for just 13 ms [61], but we selected longer time and random display of images to minimize the systematically biased perception of face attractiveness toward faces seen up to several seconds before. Xia et al. demonstrated that perceived face attractiveness was pulled by the attractiveness level of facial images encountered up to 6 s after the previous image [88]. By giving only one task to our raters we avoided multi-task cognitive overload, which could have affected raters' ability to intentionally form attractiveness impressions, and automaticity of impression formation. An additional bias could have been introduced if raters simultaneously assessed masculinity and man's attractiveness for a short or long-term partnership as in [89].
The theory of shifting women's preferences for facial masculinity across the menstrual cycle is being promoted [23,25,33]. We did not assess the masculinity of participants' faces, but the smaller size of the composite face of our attractive group suggests that women prefer rather more feminized male faces. The theory, however, is based on a premise that there is a signi cant association between testosterone and masculinity [31-33, 35, 44]. Androgenic in uence during puberty likely shapes "high-testosterone" face, which is purported to be an honest indicator of health and male tness because testosterone enhances sexual signals, but suppresses immune function [39]. A recent meta-analysis found little evidence that testosterone suppressed immune function [42], whereas another meta-analysis identi ed an opposite effect -a strong suppressive effect of experimental immune activation on testosterone [90].
Surprisingly, there are contradictive reports that substantiate associations of testosterone with masculinity [54,63,89,91]. The association remains an issue because of unresolved discrepancies between structural and masculinity ratings and methodological shortcomings of studies that used abstracted computer-manipulated images of 'high and low testosterone faces' or saliva testosterone measurements [25,75,89,91,92]. Recent ndings do not support preferences for male masculinity traits either at low-or high-conception probability groups of women [34,35,44,93,94]. Pound et al. [54] suggest that raters may attribute 'masculine' ratings to faces they nd attractive irrespective of the objective sexual dimorphism, due to stereotypical associations between the term 'masculinity' and attractiveness. Furthermore, perceived masculinity may not correlate with attractiveness [46,95], especially in view of an association of perception of sexual unfaithfulness with face masculinity [4].
Females' preferences for types of male faces could differ among populations due to ethnic, sociocultural, and human development factors [9,96]. Recent reports showed women's preferences for more feminized faces of Caucasian men [9,10,34,97]. Kocnar et al. reported that feminized male faces were preferred over masculinized faces by women in most European populations, especially in countries with high human development index [9,10]. Perrett et al. [96] reported preferences for more face feminization of Japanese and Scottish participants compared to Caucasian North Americans faces among the raters, whereas Harris et al. have found the opposite pattern [35]. Studies comparing preferences among populations provide contradictive results, hence, differences in women's preferences for masculinity or femininity between populations remain an open topic.
The selection of participants of different ages to conduct a research study on physical attraction to the opposite sex could have an impact on the ratings and comparability of results [98]. Two studies of large cross-cultural samples found that males prefer females considerably younger them themselves and females prefer males considerably older than themselves [99,100]. Most studies are based on samples of male students 18-20 years old, who may still be developing their ultimate secondary sexual characteristics and face masculinity. In a study when the mean age of men was 18, female raters favored faces of men of higher testosterone levels [71,101]. Male teenagers are less likely to be rated as attractive masculine men by women in their late twenties or older [96]. This could be a factor in studies, where the mean age of female raters was substantially higher than the mean age of men whose computer-modi ed or even natural faces were used as stimuli [25,26]. Likewise, female teenagers may give very different rates when assessing masculinity in teenagers' male faces compared to women over 30 years old [89,101]. Individual variation in evaluations of trustworthiness, dominance, and attractiveness, is largely shaped by people's personal experiences and the rapidity of their sexual development [98,102]. Thus, including teenage participants may lead to doubts whether all participants of a study are biologically and psychologically suitable to provide generalizable assessments, especially when taking into account the age disparity in sexual relationships. In our study, we opted for the agematched sample.
We used natural non-edited images of male faces to explore associations between male's facial attractiveness and TT. Such an approach links realistic TT measures in men with individual rates of women. Many other studies on preferences of women across the menstrual cycle in context men's perceived attractiveness or masculinity opted to use computer simulations or heavily edited pictures or constructs of male's face [23,25,34,91,92,101,103]. Certainly, computer alterations of face images to obtain a certain degree of masculinity or femininity facilitate face ratings and reduce the cost of the study, however, at the same time the differences between composite images of faces can be unnaturally high leading to believe that certain face images are more appealing to women than raw non-edited images of male faces. Ratings of images of real faces may present a more appropriate assessment of women's preferences than ratings of unrealistic computer-manipulated visuals.
Several studies that use surveys of females to determine the fertile time window reported contradictive results [25,26,34]. The timing of women's fertile window is often unpredictable. Wilcox et al. showed that only in about 30% of women is the fertile window entirely within the days of the menstrual cycle identi ed by clinical guidelines-that is, between days 10 and 17 [104]. A study, in which daily values of sex hormones in female participants across the cycle were determined, showed that ovulation occurred as late as 7 days before menses, and as early as on day 8th of the cycle [105]. Thus, we were fortunate enough to target accurately the fertile window as female participants were still before ovulation on day 13 of the cycle as indicated by the low concentration of serum progesterone and high concentration of estrogen. Despite we did not determine if the women were fertile, we assumed that healthy, young regularly menstruating women are fertile.
Our sample size of female participants was rather small due to the budget constraints of the project. The sample size is comparable to previous reports, however. In contrast to previous studies, the group of our raters was more homogenous and was studied in the same day of the menstrual cycle to reduce sex hormone related variability. Also, we used a state of the art statistical modeling to further reduce the probability of erroneous inferences. Our preliminary report can help to plan more powerful studies.
In summary, we have demonstrated that young healthy women prefer images of natural faces of young men with lower concentrations of total testosterone in serum on day 13 of the menstrual cycle. The size of the preferred male face tends to be smaller.

Declarations
Author contributions RS, JK, AU and RC formulated the hypotheses for this study. RS, AU, MO, KK, MM and MR were involved in data collection (blood sample collection and assessment of faces attractiveness) and literature search.
MT and RS generated and analyzed composite images of male faces. RC, RS, JK, EK and MA performed sophisticated statistical analysis of the results. RS and AU -secured funding of the project. All authors were closely involved in manuscript preparation, manuscript writing, nal review of the literature and the provision of critical feedback.

Competing Interests Statement
There is no con ict of interests among all authors. women.

Figure 2
The graph shows the inverse relationship between logarithmic values of serum total testosterone concentration in 77 men and the rate of their attractiveness as as-sessed by 19 women.

Figure 3
The images of a composite face of all 77 men (a), of 39 men scored as more attrac-tive than median (b), and of 38 men scored as less attractive than median (c) by 19 female raters. Note that compared to the attractive one the less attractive composite face is slightly bigger and was scored on average by 1.93 points lower. The men from the attractive group had lower testosterone concentration than from the less attractive group. Drawings of extracted facial landmarks such as eyes, eyebrows, nose, mouth, and jawline from a composite image of faces of all 77 men (a), of 39 men scored as more attractive than median (b), and of 38 men scored as less attractive than median (c) by 19 female raters. Note that compared to the attractive one the less attractive com-posite face is wider by 1.32% at the level of eyes, 1.76% at the level of the lower ridge of the nose, and 2.30% at the level of mouth, and shorter by about 0.08% in midline.
The average composite image of 77 men has all the measures set at 100%.