Metabolomics of Dietary Intake of Total, Animal, and Plant Protein: Results from the Atherosclerosis Risk in Communities (ARIC) Study

Background Dietary consumption has traditionally been studied through food intake questionnaires. Metabolomics can be used to identify blood markers of dietary protein that may complement existing dietary assessment tools. Objectives We aimed to identify associations between 3 dietary protein sources (total protein, animal protein, and plant protein) and serum metabolites using data from the Atherosclerosis Risk in Communities Study. Methods Participants’ dietary protein intake was derived from a food frequency questionnaire administered by an interviewer, and fasting serum samples were collected at study visit 1 (1987–1989). Untargeted metabolomic profiling was performed in 2 subgroups (subgroup 1: n = 1842; subgroup 2: n = 2072). Multivariable linear regression models were used to assess associations between 3 dietary protein sources and 360 metabolites, adjusting for demographic factors and other participant characteristics. Analyses were performed separately within each subgroup and meta-analyzed with fixed-effects models. Results In this study of 3914 middle-aged adults, the mean (SD) age was 54 (6) y, 60% were women, and 61% were Black. We identified 41 metabolites significantly associated with dietary protein intake. Twenty-six metabolite associations overlapped between total protein and animal protein, such as pyroglutamine, creatine, 3-methylhistidine, and 3-carboxy-4-methyl-5-propyl-2-furanpropanoic acid. Plant protein was uniquely associated with 11 metabolites, such as tryptophan betaine, 4-vinylphenol sulfate, N-δ-acetylornithine, and pipecolate. Conclusions The results of 17 of the 41 metabolites (41%) were consistent with those of previous nutritional metabolomic studies and specific protein-rich food items. We discovered 24 metabolites that had not been previously associated with dietary protein intake. These results enhance the validity of candidate markers of dietary protein intake and introduce novel metabolomic markers of dietary protein intake.


Introduction
Dietary intake of protein in United States adults is high and continues to increase over time [1]. Recent research has reported associations between dietary intake of total protein, animal protein, and plant protein and clinical outcomes such as chronic kidney disease, ischemic heart disease, stroke, and all-cause mortality [2][3][4][5]. Although a higher intake of animal protein generally has been associated with a higher risk of these outcomes, a higher intake of plant protein has been associated with a lower risk of incident chronic kidney disease, ischemic heart disease, stroke, and all-cause mortality [2][3][4][5]. Because each of these protein types has distinct relationships with clinical Abbreviations: ARIC, Atherosclerosis Risk in Communities; CMPF, 3-carboxy-4-methyl-5-propyl-2-furanpropanoate; DHA, docosahexaenoate; FFQ, food frequency questionnaire; GPC, glycerophosphocholine.
outcomes, interest has developed around improving assessment of dietary intake of specific protein sources.
Traditionally, dietary intake has been assessed through selfreported methods, such as 24-h recalls and food diaries. These assessment methods are prone to misreporting (e.g., recall bias) and measurement error (e.g., underestimation) [6,7]. For dietary intake of protein, 24-h urine nitrogen is an established biomarker [8]. However, it is burdensome for participants to collect urine over a 24-h period. Thus, new blood biomarkers are needed. One new approach is nutritional metabolomics, which studies small molecules in biofluids in relation to dietary intake [9].
To date, however, nutritional metabolomics has been sparsely applied to identify biomarkers of different sources of dietary protein (e.g., animal protein) [10][11][12]. Of the few studies that exist, only 1 study has identified plasma metabolites associated with plant protein intake [11]. More work is needed to identify biomarkers of dietary intake of total protein, animal protein, and plant protein. Broader metabolomic platforms need to be leveraged across larger samples to maximize the potential for biomarker discovery of dietary protein.
The overarching goal of this study was to improve dietary assessment with the discovery of objective biomarkers of dietary protein through the use of serum metabolomic profiling in 2 distinct samples of middle-aged adults. Compared with animal protein, we hypothesized that plant protein would have a distinct metabolomic profile given that dietary sources of plant protein have been previously reported to have unique metabolic signatures with fewer essential amino acids [13].

Study population and design
The Atherosclerosis Risk in Communities (ARIC) study is a prospective cohort study designed to investigate the causes and clinical outcomes of atherosclerosis. Beginning in 1987, the ARIC study enrolled 15,792 individuals from four United States communities (Forsyth County, NC; Jackson, MS; Minneapolis suburbs, MN; and Washington County, MD). A detailed description of the ARIC study design and methods have been previously published [14]. Institutional review boards (IRBs) at each field center approved of the ARIC study, and participants have given written informed consent at each study visit. This study received IRB approval from Johns Hopkins Bloomberg School of Public Health (IRB00012998; IRB00009957; IRB00011012). Procedures complied with the tenets of the Declaration of Helsinki.
Fasting serum samples were collected at visit 1 (1987-1989) and stored at À80 C until metabolomic profiling could be performed. Profiling was performed in 2 analytic subgroups in 2010 (subgroup 1) and 2014 (subgroup 2). Subgroup 1 included 1977 randomly selected Black participants from the Jackson, Mississippi field center, whereas subgroup 2 comprised a nonoverlapping set of 2055 participants from all 4 field centers. For this analysis, only participants with available metabolomic data were assessed for eligibility (n ¼ 4032). Of the 4032 participants in subgroups 1 and 2, we excluded participants with missing values for covariates: BMI (n ¼ 4), total energy intake (n ¼ 59), smoking status (n ¼ 5), physical activity (n ¼ 15), education status (n ¼ 5), alcohol consumption (n ¼ 30), and specific dietary factors (i.e., total fruit, whole grains, and refined grains) (n ¼ 0) (Supplemental Figure 1). We also examined 2 additional exclusions for participants missing protein intake information or unrealistic energy intake (defined as <600 kcal or >4200 kcal for men and <500 kcal or >3600 kcal for women), but these criteria did not result in any participants being excluded. Finally, 3914 participants were included in this analysis.

Assessment of dietary protein intake
Dietary protein intake was assessed at visit 1 using a semiquantitative, 66-item food frequency questionnaire (FFQ) adapted from the Willett questionnaire [15]. The 66-item FFQ has been previously found to have high reproducibility and validity in a subset of 418 ARIC study participants [16]. To ascertain dietary intake, interviewers asked participants to report how often they consumed each food item of a specific serving size on average during the last year. Participants were provided with 9 frequency options, ranging from almost never to >6 servings/d. Protein-related FFQ items were used to create 3 exposure categories: total protein, animal protein, and plant protein (in grams). Similar to previous analyses in the ARIC study, plant protein was defined as the difference between total protein and animal protein [17,18].

Assessment of metabolites
Fasting serum samples from visit 1 were analyzed by Metabolon, Inc., using an untargeted, gas chromatography/mass spectrometry and liquid chromatography/mass spectrometrybased protocol [19,20]. Metabolites were identified using a 2-tiered verification system. Tier 1 metabolites were compared with known reference standards and shared 2 orthogonal measurements with the standard. Tier 2 metabolites did not have a known reference available, but they were identified based on physiochemical properties or spectral similarities. Metabolites in tier 2 are represented with an asterisk in tables and figures.
Metabolomic assessment and data cleaning was performed using the same methodology as previous ARIC metabolomic studies [21,22]. Metabolites were rescaled in each sample to a median of 1 and log 2 -transformed. Metabolites were excluded if log-scale variance was low (i.e., <0.01), and values were capped at 5 SDs above the mean. After participant exclusion criteria were applied, metabolites were excluded if more than 80% of the analytic sample had missing values. For the primary analysis, metabolites not available in both samples (n ¼ 367) were excluded. A total of 360 known metabolites were included in the primary analysis. All known nondrug metabolites with missing values were imputed to the minimum value of each metabolite. Secondary analyses of metabolites that met inclusion criteria in only one subgroup of participants included 2 metabolites (4-acetaminophen sulfate and hydrochlorothiazide) in subgroup 1 and 365 metabolites in subgroup 2. The discrepancy derives from improved metabolite identification that occurred by the time serum metabolites were measured in subgroup 2.

Assessment of covariates
At visit 1, demographic and background characteristics were collected from interviewer-administered questionnaires. Age, total energy intake, physical activity, alcohol consumption, and specific dietary factors (total fruit, whole grains, and refined grains) were modeled as continuous variables. Physical activity was calculated as a score from 1 to 5, reflecting sport during leisure time, which incorporated intensity, time, proportion of year, and frequency, in addition to activity relative to peers and sweat frequency. Alcohol consumption included amount of beer, wine, and hard liquor in grams per week. To identify metabolites specific to total protein, animal protein, and plant protein, we calculated intake of several dietary factors and adjusted for these dietary factors as covariates. Total fruit intake (e.g., apples, pears, orange, peach, apricot, plum, bananas, and grapefruit), whole grains (e.g., grain bread and hot cereal), and refined grains (e.g., pie, donut, biscuit, pastry, cake, cookie, white bread, rice, and cold cereal) were estimated from the FFQ in units of servings per day. Sex, race, study center, cigarette smoking status, and education were modeled as categorical variables.
BMI was calculated from weight measured using a calibrated scale and height. Creatinine concentration was measured in serum at visit 1 with the modified kinetic Jaffe method. Then, the estimated glomerular filtration rate was estimated using the 2021 Chronic Kidney Disease Epidemiology race-free equation based on creatinine concentration [23].

Statistical analysis
Descriptive statistics were used to describe the analytic sample overall and according to subgroup, and differences between subgroups were tested using χ 2 tests for categorical variables and t tests for continuous variables. We used multivariable linear regression models to estimate crosssectional associations between dietary protein sources and serum metabolites. Our model adjusted for age, sex, BMI, total energy intake, estimated glomerular filtration rate, smoking status, physical activity, education, alcohol consumption, total fruit intake, whole grains, and refined grains. We adjusted for alcohol consumption based on a previous study that demonstrated an effect of alcohol on the metabolome [24]. We adjusted for dietary intake of fruit, whole grains, and refined grains because they represent the other major nonprotein food groups. For subgroup 2, the model included two additional covariates, race and study center, because multiple race groups and study centers were represented in subgroup 2. Estimates were generated within each subgroup and meta-analyzed with a fixed-effects model. Bonferroni correction was used to correct for multiple comparisons. For the primary analysis of metabolites in both samples, our statistical significance threshold was P ¼ 0.05/(360 metabolites Â 3 protein sources) ¼ 4.6 Â 10 À5 . For the secondary analyses, our significance threshold in subgroup 1 was P ¼ 0.05/(2 metabolites measured in sample 1 only Â 3 protein sources) ¼ 0.008 and in subgroup 2 was P ¼ 0.05/(365 metabolites measured in sample 2 only Â 3 protein sources) ¼ 4.6 Â 10 À5 . Analyses were conducted using Stata version 17 (StataCorp) and R version 4.1.2.

Results
In the overall sample of 3914 participants, the mean age was 54 y, 60% were women, and 61% were Black (Table 1). Onethird of the participants reported some college education. Subgroup 1 consisted exclusively of Black participants, whereas 27% of participants in subgroup 2 were Black (P < 0.001). There were slightly more women in subgroup 1 (64%) than those in subgroup 2 (57%) (P < 0.001). Total protein accounted for 18% of total energy intake, and approximately one-quarter of total protein intake was derived from plant protein sources. The average dietary intake of total, animal, and plant protein was similar across the two subgroups. There were 67 significant protein-metabolite associations in the primary analysis of 360 metabolites in both samples, and 41 unique metabolites were significantly associated with dietary protein intake ( Figure 1). Plant protein was uniquely associated with 11 metabolites: tryptophan betaine, 4-vinylphenol sulfate, N-δ-acetylornithine, catechol sulfate, stearoyl sphingomyelin, pipecolate, hippurate, linoleate [18:2n-6 (ω-6)], heptanoate (7:0), myo-inositol, and 2-hydroxyoctanoate ( Figure 2). Tryptophan betaine and 4-vinylphenol sulfate were the most strongly associated with plant protein. With the exception of stearoyl sphingomyelin, all these metabolites were positively associated with plant protein. The plant protein-related metabolites represented several metabolic superpathways: amino acids (n ¼ 3), xenobiotics (n ¼ 3), and lipids (n ¼ 5) ( Table 2). The most common subpathway represented among the plant protein-associated metabolites was benzoate metabolism (n ¼ 3).
For the secondary analysis of metabolites available in only one subgroup, there were 17 significant protein-metabolite associations (total protein, n ¼ 5; animal protein, n ¼ 4; plant protein, n ¼ 8) (Supplemental Table).

Discussion
In 3914 middle-aged adults, we identified 41 serum metabolites significantly associated with dietary protein intake. Plant protein was significantly associated with seven metabolites (i.e., tryptophan betaine, 4-vinylphenol sulfate, catechol sulfate, hippurate, N-δ-acetylornithine, pipecolate, and linoleate) that have been previously associated with dietary sources of plant protein.
Our metabolomic findings were specific for plant protein and distinct from the animal protein-related metabolites. Tryptophan betaine and 4-vinylphenol sulfate were the most strongly associated metabolites with plant protein. Tryptophan betaine has been detected in legumes and is the precursor to indolylacrylic acid found in lentil seedlings [25][26][27]. Tryptophan betaine and 4-vinylphenol sulfate were associated with peanuts, a rich source of plant protein [28]. Plant protein was also associated with two xenobiotics, catechol sulfate and hippurate, involved in benzoate metabolism. Benzoic acid is naturally present in fruits and fermented products, and benzoate is an antimicrobial additive used in fruits and vegetables [29]. These two xenobiotics were inversely related to dietary acid load in a previous ARIC analysis [30]. Diets high in base-producing foods (e.g., fruits and vegetables) result in lower dietary acid load. Thus, our plant protein-metabolite findings are consistent with knowledge on food metabolism and previous metabolomic findings.
Plant protein was associated with two additional amino acids, N-δ-acetylornithine and pipecolate, and five lipids. N-δ-acetylornithine was quantified in oyster mushrooms, and pipecolate was identified in beans [31,32]. After dry bean consumption, serum levels of pipecolate were elevated in human and animal studies [33]. One of the lipids, linoleate, has been identified in canola and sunflower oils [34,35]. The other 4 lipids associated with plant protein (heptanoate, myo-inositol, 2-hydroxyoctanoate, and stearoyl sphingomyelin) have not been previously linked to dietary sources. Altogether, seven plant protein-related metabolites were consistent with previous studies, and four lipid associations were novel findings, which may serve as an impetus to better characterize the effect of dietary intake of plant protein on lipid metabolism.
Animal and total protein shared 26 of the 30 metabolite associations, which was unsurprising given that nearly threequarters of total dietary protein intake came from animal protein sources, which is typical of United States diets [36]. In this study, pyroglutamine and creatine were associated with total protein and animal protein. These two metabolites have been  previously associated with meat consumption [37]. In addition, creatine concentration was associated with animal protein intake in PREDIMED and with an animal protein diet pattern in the MASALA study [11,12]. The MASALA study also identified an association between the animal protein diet pattern and a lysophospholipid, 1-docosahexaenoyl-GPC (22:6), which was positively associated with animal protein in our study. We also found a significant association between animal protein and FIGURE 2. The associations of serum metabolites with intake of plant protein in the Atherosclerosis Risk in Communities (ARIC) study. Linear regression models adjusted for age, sex, race (in subgroup 2), study center (in subgroup 2), BMI, total energy intake, estimated glomerular filtration rate based on creatinine, smoking status, physical activity, education, alcohol consumption, total fruit intake, whole grains intake, and refined grains intake. The red-dashed horizontal line represents the statistical significance threshold after accounting for multiple comparisons using the Bonferroni method [y ¼ Àln (0.05/(360 metabolites shared across both subgroups Â 3 protein sources) ¼ 9.98]. The red-dashed vertical line represents the null value of β ¼ 0. Associations were meta-analyzed across the two subgroups using fixed-effects regression models. ARIC, Atherosclerosis Risk in Communities; SE, standard error. 1 Linear regression models adjusted for age, sex, race (in subgroup 2), study center (in subgroup 2), BMI, total energy intake, estimated glomerular filtration rate based on creatinine concentration, smoking status, physical activity, education, alcohol consumption, total fruit intake, whole grains intake, and refined grains intake. Bonferroni-adjusted P value ¼ 0.05/(360 metabolites Â 3 protein sources) ¼ 4.63 Â 10 À5 . Associations were metaanalyzed across the two subgroups using fixed-effects regression models.  3-methylhistidine. 3-Methylhistidine differentiated between animal protein and soy protein in a feeding trial [10]. Finally, we identified three metabolites (β-hydroxyisovaleroylcarnitine, tigyl carnitine, and 3-hydroxyisobutyrate) from the leucine, isoleucine, and valine metabolism subpathway, which were associated with animal protein. These essential amino acids are detected in high quantities in animal protein sources [38]. β-Hydroxyisovaleroylcarnitine was identified as one of the top differentiating metabolites between animal protein-based and soy protein-based diets in a feeding trial [10].
Our study had several limitations. Our data inherited biases from self-reported dietary information (e.g., recall bias, social desirability bias, and portion size misestimation), although the questionnaire had high reproducibility [16]. There is a need for additional biomarker discovery research as an alternative approach to dietary assessment through self-reporting. This study lays the groundwork for future feeding trials to validate our findings. Due to the observational study design, we cannot rule out the possibility of residual confounding, although it was minimized by administrating standardized questionnaires by trained interviewers and inclusion of multiple covariates in multivariable regression analyses. Biospecimens were stored for over two decades before metabolomic profiling. Degradation of compounds would be nondifferential by the level of dietary protein, thereby resulting in the attenuation of estimates. The stability of compounds in long-term stored specimens has been demonstrated by moderate correlation (Pearson correlation coefficients !0.65) between urea, glucose, and creatinine measured using standard clinical measures in 1989 compared with the levels of these compounds quantified using metabolomic profiling [44]. Metabolomic profiling was conducted in two subgroups at two time points. There was a high correlation (median Pearson correlation coefficient ¼ 0.71) for 285 metabolites measured in 97 participants in 2010 and 2014, and our analyses were conducted separately within each subgroup and meta-analyzed [45]. Metabolites were measured in blood collected at one point in time. More work is needed to understand how the food metabolome changes over time.
This study has several strengths. We analyzed metabolomic data from a large, biracial cohort study that was geographically diverse. To our knowledge, this is the largest untargeted metabolomic study to report metabolites associated with dietary intake of plant protein [11] and one of the first studies to report metabolites associated with dietary intake of animal protein [10][11][12]. Our untargeted approach was advantageous in that it provided a comprehensive profile of the metabolome, allowing us to not only confirm previously observed findings but also to discover new markers of dietary protein intake. The metabolomic platform provided coverage of food-derived compounds labeled as xenobiotics, which enriched our findings as three xenobiotics (4-vinylphenol sulfate, catechol sulfate, and hippurate) were significantly associated with plant protein. We studied 2 subgroups from the ARIC study, which were demographically distinct, which allowed us to robustly assess the replicability of our findings.
In conclusion, we discovered 41 serum metabolites significantly associated with dietary protein in 3914 Black and White men and women. Seventeen of the 41 (41%) significant metabolites were consistent with prior metabolomic results. Thus, these metabolites are candidate markers of dietary protein intake. We also identified 24 new biomarkers of dietary protein, such as lipids related to plant protein intake. With external validation, these metabolites may eventually be used for objectively assessing dietary protein intake.

Acknowledgments
We thank the staff and participants of the Atherosclerosis Risk in Communities (ARIC) study for their important contributions.
The authors' responsibilities were as follows-LB: drafted the first version of the manuscript, statistical plan, data interpretation; JC: performed the statistical analysis; HK, LMS: interpreted the data and provided feedback on the final manuscript; KEW: generated the data; BY, EB: performed data acquisition; CMR: conceived and designed the study, interpreted the data, drafted and revised the manuscript, and supervised the study; and all authors: read and approved the final version of the manuscript.

Data Availability
The data described in the manuscript, code book, and analytic code will be made available on request pending application to the National Heart, Lung, and Blood Institute (NHLBI) Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC) or to the Atherosclerosis Risk in Communities (ARIC) Study Publications Committee.

Funding
This research project was funded by the National Institute of Diabetes and Digestive and Kidney Diseases (R03 DK128386). The Atherosclerosis Risk in Communities Study is performed as a collaborative study supported by the National Heart, Lung, and Blood Institute contracts (HHSN268201700001I, HHSN2 68201700002I, HHSN268201700003I, HHSN268201700004I, FIGURE 3. The associations of serum metabolites with intake of animal protein in the Atherosclerosis Risk in Communities (ARIC) study. Linear regression models adjusted for age, sex, race (in subgroup 2), study center (in subgroup 2), BMI, total energy intake, estimated glomerular filtration rate based on creatinine, smoking status, physical activity, education, alcohol consumption, total fruit intake, whole grains intake, and refined grains intake. The red-dashed horizontal line represents the statistical significance threshold after accounting for multiple comparisons using the Bonferroni method [y ¼ Àln (0.05/(360 metabolites shared across both subgroups Â 3 protein sources) ¼ 9.98]. The red-dashed vertical line represents the null value of β ¼ 0. Associations were meta-analyzed across the two subgroups using fixed-effects regression models. and HHSN268201700005I). Metabolomics measurements were sponsored by the National Human Genome Research Institute (3U01HG004402-02S1). Additional support was provided by the National Heart, Lung, and Blood Institute (R01 HL153178, to PI: CMR).