Differences in the gut microbiome of young adults with schizophrenia spectrum disorder: using machine learning to distinguish cases from controls

While an association between the gut microbiome and schizophrenia spectrum disorders (SSD) has been suggested


Introduction
Schizophrenia spectrum disorders (SSD) are severe mental disorders with symptoms of loss of reality and cognitive impairment (McCutcheon et al., 2020;van Rooijen et al., 2018).The onset of SSD is often in early adulthood with an estimated lifetime prevalence of 7.5 per 1000 (Moreno-Küstner et al., 2018).Schizophrenia (SCZ) is one of the most common diagnoses within SSD, but the SCZ diagnosis is rarely made within the first years of symptoms i.e., the early phase of SSD.
Individuals with a diagnosis of SCZ have on average 20 years shortened life expectancy mainly due to cardiovascular comorbidities, and around 20% of the individuals experiencing an acute psychosis recover during their lifetime (Nielsen et al., 2021;Taylor & Jauhar, 2019).Low grade inflammation (Fraguas et al., 2019;Lavebratt et al., 2021), cardiometabolic and cardiovascular comorbidities have been observed (Barcones et al., 2018;Carney et al., 2016) already at early stages of SSD.Antipsychotic medications, acting on dopaminergic and/or serotonergic systems, are commonly associated with obesity and cardiovascular risk (McCutcheon et al., 2020;Rognoni et al., 2021).While several factors, such as genetic variants, environmental factors, altered immune activity, as well as atypical neurodevelopment have been associated with the development of SSD, the exact mechanisms that contribute to its onset and progression remain unclear (McCutcheon et al., 2020).
Emerging research has suggested a link between the gut microbiome and several psychiatric disorders, including SSD (Borkent et al., 2022;Clemente et al., 2012;Nguyen et al., 2021;Zhu et al., 2020b).The gut microbiome refers to the complex community of microbes colonizing the gastrointestinal tract and it plays a role in the regulation of various physiological processes, such as immune function, nutrition, gut barrier integrity and metabolic regulation (Carroll et al., 2009;Cryan et al., 2019).Studies have observed an increased gastrointestinal inflammation and gut permeability in SSD, possibly leading to a dysfunctional translocation of commensal microbiota (Severance et al., 2015).Rodent studies, aiming to examine causal mechanisms by transplanting feces from SCZ patients into mice, observed changes in both behavior and brain biochemistry relevant for SCZ (Zheng et al., 2019;Zhu et al., 2020a).Both pre-clinical and clinical studies have suggested an impact of antipsychotic medication on the gut microbiome, where an in vitro study demonstrated that antipsychotic drugs had an anti-commensal activity on gut bacterial strains (Maier et al., 2018;Singh et al., 2022;Yuan et al., 2021).Moreover, it has also been reported that physical activity can impact the gut microbiome in adults (Tzemah Shahar et al., 2020).
According to a systematic review of the fecal microbiome in SCZ, including 15 studies, there was a difference in bacterial composition (β-diversity) between SCZ patients and healthy controls in 79% of the studies (McGuinness et al., 2022).When investigating the underlying bacterial genera, findings were conflicting.Nevertheless, a third of the studies observed higher abundance of Prevotella and lower abundance of Haemophilus in SCZ.Regarding functionality of the microbiome, the findings are inconclusive with few studies existing and the use of different methods to generate the functional data (bacterial genes) (McGuinness et al., 2022).Only one study used metagenomic shotgun sequencing, a method which enables a comprehensive insight into the bacterial genes and hence a deeper understanding of the functional profile of the microbiome (Weinstock, 2012).Studies have shown that shotgun sequencing compared to the more traditional method, 16S rRNA sequencing, gives a broader and deeper insight into the bacterial community, due to it being able to identify more biologically relevant low abundant genera and the individual genes within the bacterial genomes (Campanaro et al., 2018;Durazzi et al., 2021).In addition, characterizing the functional profile of the microbiome can lead to a more accurate and biologically relevant understanding of its potential influence on host physiology (Gibbons, 2017).
Importantly, few studies have been conducted on the microbiome in the early stages of SSD, with only one study using shotgun sequencing (Zhu et al., 2020b), see review (Tsamakis et al., 2022).Interestingly, Zhu et al found that bacterial species commonly present in the oral cavity were more abundant in SSD patients, and transplantation of an SSDenriched Streptococcus into mice induced deficits in social behaviors, and altered neurotransmitter levels in peripheral tissues.The patient group studied by Zhu et al was from China and included both firstepisode patients and acutely relapsed patients, with 10% being younger than 18 years.Studies of the early stage of SSD have the potential to provide markers of the disorder less influenced by long-term environmental exposures to various medications, a less healthy lifestyle and metabolic comorbidity.Understanding the pathophysiology and signs of the disorder in the early stage of SSD is important as early effective treatment can shape the long-term prognosis.
To this end, we conducted a study investigating the gut microbiome in a more homogenous group, young adults with early SSD in Sweden, using fecal shallow shotgun metagenomic sequencing.We hypothesized that the fecal bacterial microbiome in patients with early SSD would differ compared to healthy controls.Further, two exploratory studies were conducted with the hypotheses that the fecal bacterial microbiome in early SSD would (i) differ between those on antipsychotic medication and those not, and (ii) be affected by physical exercise.Our analysis encompassed bacterial taxonomy and bacterial gene data from 52 young SSD patients and 52 controls, the vast majority being non-obese.We focused on baseline microbial diversity, evaluated the microbiome's diagnostic potential using classification models, identified distinctive bacterial features between the two groups, and explored the impact of antipsychotic medication on the microbiome.We performed an exploratory analysis also on the effect of physical exercise using followup samples from 24 of these patients who had participated in weekly group exercise for 12 weeks.

Participants
The patient cohort has been previously described (Forsell et al., 2015;Lavebratt et al., 2021).In brief, the patient cohort is part of a study that was designed in 2015 to investigate the effects of physical exercise on behavioral and biological outcomes in young adults with SSD in Stockholm, Sweden.Individuals with a diagnosis of SSD (F20-F29, according to ICD-10) who were receiving specialized care for a first documented psychosis episode within the previous five years were considered eligible.The study received ethical approval by the Regional Ethics Review Board in Stockholm (number 2015/808-31/2) and was prospectively registered in the German Clinical Trial Register (DRKS00008991).Ninety-one young SSD patients were enrolled in the study and provided baseline data.Of those, 52 patients provided fecal samples at baseline, and 24 patients that had participated in physical exercise provided fecal samples also at follow-up 12 weeks later (number of attended sessions, median (IQR)= 19 (5.25-36.50)),that passed the quality control of the shotgun sequencing data.For the 52 baseline participants the median age was 30 years (IQR = 24-38), with a majority being males (57.7%).Table S1 shows characteristics of patients with fecal samples and patients with no fecal samples.In addition to fecal samples, metabolic blood markers including HDL (high-density lipoprotein), LDL (low-density lipoprotein) and cholesterol were analyzed from young SSD patients.The participants, with the help of nurses, also provided information about their body mass index (BMI), blood pressure, antibiotic and psychotropic drug usage, year of first psychiatric care contact, number of forced hospitalizations, alcohol use, illicit drug use and responded to the Camberwell assessment of need & support questionnaire and a food frequency questionnaire, see Fig. 1A.Based on the food frequency questionnaire, a principal component analysis (PCA) was performed on the 9 different food groups.The three first principal component axes (PC1, PC2 & PC3) were then used as diet intake variables in statistical analyses.
The 52 adult controls included in this study were originally collected as controls for a study of the gut microbiome in ADHD patients (Skott et al., 2020;Stiernborg et al., 2023).The median age of the controls was 40 years (IQR = 32.8-43.2),with 40.4% males.More extensive information about the participants, questionnaires and metabolic data can be found in the supplementary methods S2.1, S2.2 and S2.3.

Fecal collection, DNA extraction and sequencing
The individuals self-collected their fecal samples at home using the OMNIgene gut kit (DNAGenotek, Ottawa, Canada).This kit is specifically designed to stabilize microbial DNA in fecal samples.Once collected, the samples were handed over to research nurses within a week and stored at -80 • C until further analyses, including one freeze--thaw cycle.If fecal samples had clumps > 1 cm in diameter after being vortexed vigorously for 45-60 s, a total of 400 μl of OM-LQR/P-190 liquefaction (DNAGenotek, Ottawa, Canada) was added to the tube.A) The experimental design.In total, 104 individuals were included in the analyses: young adults with SSD, n=52, and adult healthy controls, n=52.Shallow shotgun sequencing was used to sequence the fecal samples resulting in taxonomic (bacterial species) and functional (bacterial genes encoding KEGG modules, GBMs and KEGG enzymes) data.In addition, psychiatric and diet-scale scores were collected.In addition to the 52 baseline samples, 24 patients participated in a 12-week intervention with weekly group exercise sessions, these analyses can be found in the supplementary results S3.6.B) α-diversities: Taxonomic species Shannon diversity, number of KEGG modules, GBMs Shannon diversity and number of KEGG enzymes in young SSD patients and controls.The y-axes represent the unadjusted levels.Differences between young SSD and controls were tested using linear regression adjusted for age, sex, BMI and read depth.C) β-diversities: Taxonomic species measured by Aitchison distance, functional modules measured by Jaccard distance, functional GBMs measured by Aitchison distance and functional enzymes measured by Jaccard distance.Differences were tested using Adonis2 adjusted for age, sex, BMI and read depth.D) Machine learning models using SVM, RF, Ranger RF and PLR for case vs control classification based on rarefied clr-transformed taxonomic and functional data.The heatmap shows the accuracy on the test-set (n young SSD = 15, n ctrl =15) from the models built on the training-set (n young SSD = 37, n ctrl = 37).All the models were built with leave-one-out cross validation.E) AUROC of the test-set with the best performing model for each feature category.The performance was based on the accuracy score.Data is presented in box plots with boxes showing the 25 th , 50 th and 75 th percentiles and the whiskers extending to the largest/smallest values not beyond 1.5 * inter-quartile range from the hinge.Diversigen Inc., (Minneapolis, USA) conducted the DNA extraction, DNA sequencing, and taxonomic/functional annotation for this study using their Boostershot® shallow shotgun metagenomic sequencing service.The DNA extraction was performed using the PowerSoil Pro extraction kit (Qiagen, Germany), and the sequencing was carried out on a NextSeq 1x150 flow cell (Illumina, USA).The samples were sequenced in two separate rounds (1st round: June 2019 with 81 young SSD patient samples, 2nd round: December 2019 with 52 control samples).The sequencing depth was greater in the 2nd round compared to the 1st round, p wilcoxon = 2.2e-16, IQR 1st batch = 2.5 M (2.1 M -2.9 M), IQR 2nd batch = 4.1 M (3.7 M-5.0 M).Thus, in all statistical analyses, we adjusted for the sample library size (total count).More specifically, for analyses on taxonomic data 'species total count' was used, and in analyses on functional data 'KOs total count' was used as adjustments in linear models.Two samples from the 1st round were sequenced in the 2nd round as well, and clustered with themselves on a PCoA plot based on Bray-Curtis explaining 99.7% of the variance suggesting similar sequencing results obtained from the two rounds.

Data processing 2.3.1. Taxonomic annotation
DNA sequences were aligned in 2022 to Diversigen's Venti database, a curated database created in 2017 containing all representative genomes in RefSeq for bacteria with additional manually curated strains (in total n=19 840).Alignments were made at 97% identity against all reference genomes.Every input sequence was compared to every reference sequence in the Venti using fully gapped alignment with BURST (Al-Ghalith & Knights, 2020).Ties were broken by minimizing the overall number of unique Operational Taxonomic Units (OTUs).For taxonomy assignment, each input sequence was assigned the lowest common ancestor that was consistent across at least 80% of all reference sequences tied for best hit.OTUs accounting for less than one millionth of all strain-level markers and those with less than 0.01% of their unique genome regions covered (and <0.1% of the whole genome) at the species level were discarded.The number of counts for each OTU was normalized to the OTU's genome length.

Functional annotation
After alignment to the Venti database, Kyoto Encyclopedia of Genes and Genomes (KEGG) Orthologs (KOs) were observed directly using alignment at 97% identity against the genes in the Venti database using the KEGG database (https://www.genome.jp/kegg/ko.html).Ortholog and paralog groups based on sequence similarity scores were grouped into KOs linking these genes to high-level functions (Kanehisa, 2017;Kanehisa et al., 2016).In this study, KOs that represent functional enzyme orthologs are used and referred to as 'bacterial genes encoding enzymes' or 'enzyme-encoding bacterial genes' or just 'enzymes'.Furthermore, KOs were collapsed into KEGG modules, which in this study are termed 'bacterial KEGG modules' or just 'modules'.Gut-brain modules (GBMs) are manually curated modules representing a single neuroactive compound production or degradation process (Darzi et al., 2016;Valles-Colomer et al., 2019).In our study, GBMs were generated from KOs using the R-package omixer-rpmR.

Pre-processing of annotated data
Prior to analysis, the bacterial and functional count tables were filtered.Species, enzymes, KEGG modules and GBMs (i) being present in less than 10% in either controls or young SSD patients or (ii) having a median count below 7 (each sample that had the feature present was included in the median), were removed from the count tables.Additionally, three samples were excluded from the analyses, one baseline sample due to having inadequate sequencing depth (<22 000), and two follow-up samples due to having no corresponding baseline samples, leaving 52 baseline SSD samples, 52 control samples and 24 follow-up SSD samples for final analysis.
The count tables were then rarefied with the lowest depth on 472 000 counts for the taxonomic data, while for the functional data the lowest depths were as follows: KOs=162 000 counts, KEGG modules= 154 766, and GBMs= 2 808 and enzymes= 163 000.All samples had an original sequence depth above the rarefication depth.The rarefication and rarifying curve were generated using Quantitative Insights Into Microbial Ecology (QIIME 2) version 2020.6 (Bolyen, 2019).An additional filtering step was performed before machine learning analysis, where features with a rarefied median count below 2 were filtered out from the count tables.

Statistical methods
Due to the different read depths in the two sequencing rounds, the total count of species/KO:s for each individual (library size, microbial load) was used as a covariate in all statistical analyses.Statistical significance was set at α = 0.05 and was corrected, when appropriate, for multiple comparisons using the Benjamini-Hochberg (BH) false discovery rate (FDR).The p-value is given when p<0.1.When analyzing α-diversity and differential abundance, a linear model was applied.If the data were not normally distributed, a linear model with permutation tests from the R-package lmPerm (Wheeler & Torchiano, 2016) was used.All statistical analyses were performed using the R Studio version 1.4.1106using the packages vegan (Oksanen et al., 2008), lmPerm (Wheeler & Torchiano, 2016), qiime2R (Bisanz, 2018), dplyr (Wickham et al., 2023) and ggplot2 (Wickham, 2016).

Diversity analysis
The α-diversity metrics 'observed number of features', Shannon, and Pielou's evenness and the β-diversity distances Jaccard, Bray-Curtis and Aitchison were generated using QIIME 2 2020.6 (Bolyen et al., 2019).To assess differences in β-diversity metrics, Adonis2 was used, which is a non-parametric Permutational Multivariate Analysis of Variance Using Distance Matrices (Anderson, 2001).As previously noted, total count was always adjusted for in these analyses, including the comparison between young SSD patients and controls.To analyze the difference, the Adonis2 function from the R package vegan (Oksanen et al., 2008), was used with adjustments for age, sex, BMI and diet (PC1, PC2 & PC3).The adjustments were marginal, meaning the model assessed the marginal effects of the terms.The number of permutations used in Adonis2 was 999.

Predicting diagnosis with microbial features
To evaluate the possibility of classifying samples into diagnostic groups based on the fecal microbiome (controls/young SSD), we used several machine learning classifier algorithms including support vector machines (SVM), random forest (RF), Ranger random forest (Ranger RF) and penalized logistic regression (PLR) using the R package caret (Classification And REgression Training) (Kuhn, 2008).Using the crea-teDataPartition() function, 70% of the samples were randomly selected as the training-set, with equal proportions from the young SSD group (n young SSD =37) and the control group (n controls =37).The remaining 30% of the samples were designated as the test-set, again with equal proportions from the young SSD group (n young SSD =15) and the control group (n controls =15).We employed leave-one-out cross-validation.
Taxonomic species and functional profiles (KEGG modules, GBMs and KEGG enzymes) were used as predictors in the classification models.Both the taxonomic and functional profiles where rarefied and centered log ratio (clr)-transformed.To overcome the caveat with logarithmized zeros, an offset of 1 was added to all values prior to transformation.In brief, two main types of analyses were performed.First, classification models were performed on each feature category: taxonomic species and functional profiles, using all available features.This encompassed 474 species, 279 KEGG modules, 1580 KEGG enzymes, or 33 GBMs.Second, a subset of features was selected for classification using the VarImp function from the caret package.This function identified a subset of informative features based on their variable importance.Specifically, 11 species, 16 KEGG modules, 7 GBMs, and 11 KEGG enzymes were selected as the informative subset for classification.
The performance of a model was evaluated by the accuracy (the number of correct predictions divided by the total number of predictions of the test set) and area under the receiver operating characteristic curve (AUROC).The higher accuracy the better performance of the model, where the values range from 0 to 1.The models with the highest pointwise accuracy are bolded in Fig. 1D and 2A.All AUROC and accuracy values can be found in supplementary file 1. Detailed information is found in supplementary methods S2.4.

Feature selection
The selection of features (here being specific taxonomic species and functional units) included assessing the variable importance of each feature using the VarImp function from the CARET package (Kuhn, 2008).
The feature selection process used here involved filterVarImp representing SVM and PLR, and a specific VarImp method for RF and Ranger RF.The features that were found in the final list of important features in both RF specific VarImp and the no model-specific VarImp were the ones that were selected.These selected features (i.e., 11 species, 16 KEGG modules, 7 GBMs, and 11 KEGG enzymes) were then used in the second analysis, where the classification of young SSD patients was performed using only the subset of the most important features.More extensive information about the feature selection can be found in supplementary methods S2.5.As the classifier XGB (XGBoost) did not perform well in the models based on all of the features, XGB was not run on selected features, see Supplementary Fig. S1.

Differential abundance
The abundances of selected features used for the classification models were compared between controls and young SSD patients using clr-transformed rarefied values and adjusting for age, sex and BMI.We assessed the difference in abundance between young SSD patients and controls using linear regression.The p-values were corrected for multiple testing by using the BH false discovery rate (FDR).In addition, MaAsLin2 (Mallick et al., 2021) was used as a complementary method to determine multivariable association between microbiome features and clinical phenotypes/exposures. Specifically, we assessed associations for the species to the use of antidepressants, the use of antipsychotics and diet diversity in the young SSD patients, using rarefied count data as input and the investigated variable as fixed effect.More extensive information can be found in supplementary results S3.1.

Sensitivity analyses
A sensitivity analysis was carried out comparing young SSD patients with controls in a subgroup with no significant difference in BMI [n patients = 38, median(IQR) patients =25.3(22.8-27.6);n controls = 38, median(IQR) controls =23.7(22.7-26.1)]by removing the patients with BMI above their 75th percentile, and the controls with BMI below their 25th percentile (Supplementary methods S2.6).
An additional sensitivity analysis was performed on the three young SSD patients that had been on antibiotic medication anytime within the last 4 weeks prior to sampling.In addition, the case-control differences in diversity remained when excluding the three patients that had been on antibiotics.Of note, no participant was on antibiotic medication during the study.These analyses can be found in supplementary results S3.2, Table S2 and Fig. S2.

Subgroup analysis
Association between antipsychotic medication and the microbiome was assessed comparing diversity of young SSD patients on antipsychotics, to SSD patients not on antipsychotic medication, adjusted for age, sex, BMI and diet.α-diversity differences were assessed using linear models and differences in β-diversity metrics were assessed using Adonis2 (Anderson, 2001).
Additional statistical analyses are described in supplemental methods S2.7.In addition, presence of fecal eukaryotic and archaeal data in young SSD patient can be found in supplementary results S3.3 and supplementry methods S2.8.

Baseline characteristics
In this study, fecal samples were collected and statistically analyzed from 52 young adults with schizophrenia spectrum disorder (SSD) and 52 adult healthy controls.The age of the young SSD-patients (median = 30 years) was significantly lower compared to the control group (median = 40 years).Young SSD patients exhibited significantly higher BMI (median = 26.5 kg/m 2 ) compared to controls (median = 23 kg/m 2 ) see Table 1.There was a significant difference in diet intake between young SSD patients and controls, with diet being measured by the three first principal component axes (PC1, PC2 & PC3) from a dietary PCA based on frequency of intake from 9 different food groups (Supplementary results S3.4 and Table 1).All the young SSD patients had a cholesterol ratio value within the clinical reference interval.All the young SSD patients, but one, had a HbA1c value within the normal range (clinical reference value for HbA1c: 42 mmol/mol; the one patient had a borderline higher HbA1c level of 43 mmol/mol).
Figure 2. Microbial species, enzymes, modules and gut-brain modules (GBMs) differently abundant in young SSD patients compared to controls.A) Machine learning models using SVM, RF, Ranger RF and PLR for case vs control classification based on the taxonomic and functional features that had the most importance for the overall model (number of selected features: n species = 11, n modules = 16, n GBMs = 7, n enzymes = 11).The heatmap shows the accuracy on the test-set (n young SSD = 15, n ctrl =15) from the models built on the training-set (n young SSD = 37, n ctrl = 37).All the models were run with leave-one-out cross validation.B) AUROCs of the test-set for the best performing model for each feature category based on the selected features.The performance was based on the accuracy score shown in panel A. The difference in abundance between controls and young SSD patients of the selected features used in the classification models is shown in C) for species, D) KEGG modules, E) GBMs and F) KEGG enzymes as Forest plots with estimates and 95% CI.The differences between controls and young SSD participants were assessed using linear regression on clr-transformed rarefied count data adjusting for age, sex, BMI and read depth.

The microbial diversity in fecal samples from young SSD patients differ compared to controls
In this study, we examined the fecal bacterial species diversity in young SSD patients compared to controls.To account for potential confounding factors known to influence the fecal microbiome, we adjusted for age, sex, BMI, and diet intake.We did not adjust for antipsychotic medication in our case-control analyses, as the number not on antipsychotics was small, n=9.However, comparisons based on antipsychotic medication use (yes/no) were performed within the group of young SSD patients and are presented below in 3.5.A summary of the results can be found in supplementary table S3.
When investigating species α-diversity using linear models adjusted for age, sex, and BMI, young SSD patients had a significantly lower observed number of species (p=0.0050),evenness (p=0.0010), and Shannon diversity (p=4.0e-06)compared to the control group (Fig. 1B, leftmost panel).The subset of the cohort with information on diet (i.e., 37 controls and 24 young SSD patients) was further analysed adjusting for diet in addition to age, sex and BMI.All three metrics remained significantly lower in SDD patients compared to controls (p observed =0.00020, p Evenness =0.0056, p Shannon =0.0020).The fecal bacterial species β-diversity differences between young SSD patients and controls were assessed using Adonis2.When adjusting for age, sex, and BMI, we observed significant differences in the three distance metrics Jaccard (R 2 Jaccard =2.2%, p Jaccard =0.0024), Aitchison (R 2 Aitchison =1.9%, p Aitchison =0.00090), and Bray-Curtis (R 2 Bray-Curtis =2.5%, p Bray- Curtis =0.0043), see Fig. 1C, leftmost panel.In the sub-cohort analysis, considering diet intake in addition to age, sex, and BMI, both Jaccard and Aitchison metrics remained significant (R 2 Jaccard =2.5%, p Jaccard =0.049; R 2 Aitchison =2.2%, p Aitchison =0.039).Among the young SSD patients, n=52, no clinical characteristic was FDR-significantly associated with neither species α-nor β-diversity (Supplementary Fig. S2 & S3).The association test results for α-diversity and β-diversity with clinical characteristics in young SSD patients, controls, and young SSD patients combined with controls, can be found in the supplementary results S3.5 (Fig. S3 & S4).It is noteworthy that the β-diversity remained significantly different in SSD patients compared to controls when we performed a sensitivity analysis using a sub-cohort with similar BMI and age distributions in cases and controls (Supplementary results S3.6).
We also examined the functional diversity, represented by bacterial genes encoding 1) KEGG modules, 2) gut-brain modules (GBMs), and 3) KEGG enzymes.KEGG modules: looking at α-diversity, using linear models, we found that young SSD patients had a significantly lower observed number of KEGG modules (p<0.00010)compared to controls adjusted for age, sex and BMI (Fig. 1B, second panel to the left).The difference in observed number of KEGG modules remained significant (p<0.00010)after adjustment also for diet in the sub cohort (n control =37, n young SSD =24).When examining the β-diversity of KEGG modules, the Adonis2 test revealed a significant difference in the Aitchison distances between young SSD and controls adjusted for age, sex and BMI (R 2 Aitchison =1.7%, p=0.050;Fig. 1C, second panel to the left).Aitchison did not remain significant in the sub-cohort with diet data when adjusting for diet intake.No significant difference was observed in the other diversity metrics.GBMs: When investigating GBMs, we detected no significant difference in neither αnor β-diversity between young SSD patients and controls.KEGG enzymes: Investigating α-diversity, young SSD patients exhibited a significantly lower observed number of KEGG enzymes (p=0.0014)compared to controls adjusted for age, sex, and BMI (Fig. 1B, rightmost panel).The difference in observed number of enzymes remained significant (p observed <0.00010) after adjustment also for diet in the subgroup.In terms of β-diversity there was a significant difference in the Jaccard distance metric between young SSD patients and controls when adjusting for age, sex, and BMI (R 2 Jaccard =1.6%, p Jaccard =0.035), see Fig. 1C, rightmost panel.Jaccard did not remain significant in the sub cohort (n control =37, n young SSD =24) when adjusting for diet intake (p Jaccard =0.054).No significant difference was observed between young SSD patients and controls for neither Aitchison distance (p=0.062)nor Bray Curtis distance (p=0.087).

The fecal microbiome may classify the young SSD patients and controls
To evaluate the potential of classifying SSD patients from controls based solely on data from the fecal microbiome, we applied several classifier algorithms SVM, RF, Ranger RF and PLR.We applied taxonomic species or functional profiles (KEGG modules, GBMs and KEGG enzymes).
We observed that the PLR classifier outperformed other classifier models when utilizing taxonomic species and functional profiles with high data dimensionality (more than 100 features).For the GBMs, which had fewer features (n=33), the random forest (RF) classifier exhibited the highest accuracy, see Fig. 1D.The PLR classifier based on enzymes showed the highest AUROC value (AUROC enzymes =0.88) differentiating between young SSD patients and controls in the test set, see Fig. 1E.
To explore the discriminatory potential of a subset of features in distinguishing between patients and controls, a second classification analysis was performed.The selected features were chosen based on their importance and robustness, as described in the methods section.
The number of selected features was for species n=11, KEGG modules n=16, GBMs n=7 and KEGG enzymes n=11.
When using only the selected features, most features retained their best accuracy score.The best obtained accuracy score of the GBMs models slightly improved, while that of the KEGG modules models slightly decreased.In general, RF and Ranger RF outperformed the other classifiers when using only the selected features, see Fig. 2A.Similar to the full models, the model based on KEGG enzymes had the highest AUROC.Interestingly, the model based on the selected species performed just as well as the model using all species as input, as shown in Fig. 2B.This suggests that a smaller subset of species features was sufficient to achieve comparable classification accuracy.However, when using the selected subset of KEGG modules, both the accuracy and AUROC of the model were lower compared to the model using the full set of modules, see Fig. 2A & B. This indicates that the selected subset of KEGG modules may not provide enough discriminatory information on its own.Extended results can be found in supplementary results S3.7.

Species and functional profiles differ between young SSD patients and controls
To evaluate the difference in abundance of the selected features in the classification models between young SSD patients and controls, we analyzed the rarefied clr-transformed values, which were adjusted for potential confounders age, sex and BMI.
Among the selected bacterial species, seven out of the 11 species showed FDR-corrected significant differences between young SSD patients and controls.Of these species, four exhibited higher abundance in young SSD patients, while three displayed lower abundance, see supplementary file 2. Interestingly, of the four species with increased abundance in young SSD patients two belong to the genus Streptococcus (Streptococcus pasteurianus, Streptococcus parasanguinis) commonly inhibiting the oral cavity and one belong to the genus Actinomyces (Actinomyces sp.oral taxon 448) see Fig. 2C.Additionally, Akkermansia muciniphila was more abundant in the control group compared to young SSD patients.Since previous literature has demonstrated an association between obesity and lower abundance of A. muciniphila (Roshanravan et al., 2023), and since SSD patients had higher BMI compared to the control group in our study, we wanted to ensure that the difference remained in a subgroup with no significant difference in BMI [n patients = 38, median (IQR) patients =25.3 (22.8-27.6);n controls = 38, median (IQR) controls =23.7 (22.7-26.1)].In this subgroup, A. muciniphila was still more abundant in controls compared to patients (p=0.0238;Estimate= 1.047, CI= 0.187-1.906).In addition, A. muciniphila has been suggested to be more sensitive to antipsychotic drugs than other gut bacterial strains (Maier et al., 2018).However, in our cohort we did not find any difference in the abundance of A. muciniphila between patients on antipsychotics and those not on antipsychotics adjusted for age, sex and BMI (p=0.354),see supplementary Fig. S5.
For bacterial GBMs, four of the seven selected GBMs were FDRsignificantly different between controls and young SSD patients adjusted for age, sex and BMI, see Fig. 2E.MGB005 (Tryptophan synthesis) was more abundant in young SSD patients while MGB047(Acetate degradation), MGB052 (Butyrate synthesis I) and MGB053 (Butyrate synthesis II) were less abundant in young SSD patients compared to controls.
For KEGG enzymes, all the 11 selected enzymes were significantly different between young SSD patients and controls adjusted for age, sex and BMI, see Fig. 2F, with 10 of them being more abundant in young SSD patients and one being less abundant in young SSD patients compared to controls.Six of the 11 enzymes were transferases, while there was one hydrolase and one oxidoreductase.

Antipsychotic medication is associated with higher microbial α-diversity
Among the young SSD patients, 83% were on antipsychotic medication.There was no FDR-adjusted association between antipsychotic medication and diversity when no patient characteristics were controlled for (Suppl.Fig. S3 & S4).To explore the potential impact of antipsychotic medication on fecal microbial diversity, we compared individuals who were not on antipsychotics (n=9) to those who were on antipsychotic medication (n=43) controlling for age, sex and BMI.No significant difference in β-diversity was observed between the two groups.However, when investigating taxonomical α-diversity, we observed significantly higher species evenness (p=0.023) and Shannon diversity (p=0.024) in the group that used antipsychotic drug adjusted for age, sex and BMI, see Fig. 3A.Similarly, when assessing functional profiles, enzymes and GBMs were found to have higher evenness (p enzymes =0.012, p GBMs =0.012) and Shannon diversity (p enzymes =0.032, p GBMs =0.022) in those on antipsychotic medication, adjusted for age, sex and BMI, see Fig. 3B-C.There was no difference in diversity regarding antidepressant use (p>0.050).

Discussion
This study aimed to investigate the fecal microbiome in young adults with SSD in comparison to that in healthy controls by (i) examining baseline microbial diversity, (ii) assessing the potential of the microbiome data to distinguish between young SSD patients and controls by using classification models, and (iii) identifying features that are differentially abundant between young SSD patients and controls.Secondary aims were to explore (i) the association between microbial diversity and antipsychotic medication, as well as (ii) the bacterial temporal changes in young SSD patients over a three-month exercise intervention, the latter is reported and discussed in Supplementary Information (methods S2.9, results S3.8, S3.9, Table S4 and Fig. S6).

Case-control microbiome diversity
Firstly, we aimed to investigate differences in bacterial species diversity between young SSD patients and controls.The bacterial species α-diversity was lower in terms of both evenness and richness in young SSD patients compared to controls adjusting for age, sex, BMI and diet intake.When investigating the functional diversity, young SSD patients had significantly lower number of KEGG modules and enzymes adjusted for age, sex, BMI and diet intake.The few studies having investigated microbiome in an early stage of SSD reported conflicting result regarding α-diversity, with studies showing no difference (Ma et al., 2020;Zhang et al., 2020), or an increase (Zhu et al., 2020b).However, it is difficult to compare our findings to these studies since they were small (n<30) and/or used 16S rRNA sequencing or Quantitative real-time PCR on DNA with one of them not presenting diversity measures (Schwarz et al., 2018) and one only measuring five bacteria (Yuan et al., 2018).Only Zhu et al. have shotgun data and an adequate sample size (n=90), however their early-stage SSD patients also include patients that had acutely relapsed, and all the patients were antipsychotic medication free.Moreover, studies investigating the fecal microbiome in chronic SSD have also showed conflicting findings (McGuinness et al., 2022).When adjusting for age, sex and BMI, all three β-diversity metrics were significantly different between young SSD patients and controls, with Jaccard and Aitchison still being significantly different when adjusting for diet intake.Furthermore, KEGG enzyme Jaccard β-diversity and KEGG modules Aitchison β-diversity were significantly different between young SSD patients and controls.The difference in β-diversity between young SSD patients and controls was in accordance with the other study using shotgun data in early SSD/relapsed SSD (Zhu et al., 2020b).The difference in beta diversity between young SSD patients and controls remained significant when performing a sensitivity analysis in a sub-cohort with similar BMI distribution in cases and controls.Thus, there was a lower αand a different β-diversity of bacterial species and genes encoding functional units (modules/GBMs/enzymes) in young SSD patients compared to controls, which was not explained by age, sex, BMI or diet.

Case-control classification and differentially abundant microbiome features
Using machine learning classifiers based on the taxonomical species data we were able to distinguish between young SSD patients and controls with an accuracy of 76%.Similarly, the classifier based on KEGG enzymes had a prediction accuracy of 76%.Feature importance scores from our models showed that no single bacterial feature was responsible for the prediction performance, but rather the combination of taxa in relation to each other.This highlights the need for analyzing and viewing the features in combination, rather than viewing them individually.We also found that no classifier algorithm was superior for all the different feature categories.This reinforces the importance of using several classifiers to get as good fit as possible on microbiome data.
The differential abundance of the selected features obtained from the classifications models were also assessed.Three species were more abundant in controls compared to young SSD patients, with one of them being Akkermansia muciniphila.A. muciniphila is a mucin-degrading bacteria that plays a role in the gut health by maintaining the epithelial barrier (Luo et al., 2022).A. muciniphila has been viewed as a symbiont and a promising next generation probiotic with antiinflammatory effects.However, there is literature suggesting A. muciniphila may not always have a beneficial role in the gut (Luo et al., 2022).For example, studies in rodents have shown that too much of A. muciniphila may lead to exacerbation of intestinal inflammation in certain gut environments such as during infection (Ganesh et al., 2013;Seregin et al., 2017).An increase in A. muciniphila has also been observed in patients with multiple sclerosis (MS) and Parkinson's disease (Fang et al., 2021).However, elevated A. muciniphila in MS is speculated to be a compensatory beneficial response in MS (Cox et al., 2021).Relevant to our cohort, A. muciniphila has been suggested to be beneficial for metabolic-related disorders such as obesity (Fang et al., 2021) and the fecal abundance of A. muciniphila has been observed to be decreased in obese and diabetic individuals (Roshanravan et al., 2023).Interestingly, two placebo-controlled trials have been conducted including A. muciniphila showing no adverse effects and an improvement of plasma levels of cholesterol, insulinemia and insulin sensitivity (Depommier et al., 2019;Fanny et al., 2020).
Noteworthy, in our study the difference in abundance of A. muciniphila between SSD patients and controls remained significant when performing a sensitivity analysis in a sub-cohort with similar BMI distribution in cases and controls.In contrast to our findings, A. muciniphila was identified to be significantly enriched in early SSD/ relapsed SSD (Zhu et al., 2020b).Interestingly, an in vitro study showed that A. muciniphila was more sensitive to antipsychotic drugs than other gut bacterial strains were (Maier et al., 2018).Nonetheless, in our cohort we did not find any difference in the abundance of A. muciniphila between SSD patients on antipsychotic medication and SSD patients without antipsychotics, but we cannot exclude that this negative finding on A. muciniphila and antipsychotic medication is false as there were only 9 patients not on antipsychotic medication.
Interestingly, we found that three of the four species with higher abundance in young SSD patients were bacteria belonging to the genera Streptococcus and Actinomyces commonly found in the oral cavity with at least two of the species being facultative anaerobic.In accordance with our findings, Zhu et al., found a higher abundance of bacteria often present in the oral cavity in early SSD/relapsed SSD (Zhu et al., 2020b).Previous research on the intestinal microbiome has demonstrated a higher relative abundance of bacteria more commonly found in the oral cavity in patients with colorectal cancer (Debelius et al., 2023), IBD (Gevers et al., 2014), and chronic liver disease (Park et al., 2021).The translocation of oral microbiota into the gut microbiota, with barriers being physical distance and chemical hurdles e.g.gastric and bile acids, remains poorly understood.Kitamoto & Kamada hypothesized that translocation of oral microbes into the gut happens when there is a dysbiosis in both the oral and gut microbiome and when the colonization resistance of the gut is impaired (Kitamoto & Kamada, 2022).Rodent studies show that translocation of oral pathobionts can lead to gut inflammation (Kitamoto et al., 2020).Drugs, such as proton pump inhibitors (PPI), that lower the gastric acidity has been shown to increase the gut colonization by oral bacteria (Atarashi et al., 2017;Bruno et al., 2019).Unfortunately, there is no information about the PPI usage in our cohort, which is a study limitation.
We also found that six of the 11 bacterial KEGG modules with higher abundance in young SSD patients are involved in the biosynthesis of amino acids.The amino acids implicated in these modules were Ornithine, Histidine, Serine, Methionine, Proline and Tryptophan, all with unbranched side chains.These are not only metabolized by the gut microbiome but also by the large intestine and embedded neural cells belonging to the enteric nervous system, and they can also translocate into the circulation.These amino acids are involved in various metabolic pathways, such as the generation of histamine, glycine, cysteine, phospholipids, the methyl donor S-adenosylmethionine (SAM), collagen, serotonin and kynurenine metabolites (van der Wielen et al., 2017).Furthermore, pathways of Proline and Ornithine synthesis were increased in fecal samples from Pulmonary arterial hypertension (Kim et al., 2020).The field of metagenomics is emerging and fecal bacteriagenerated branched-chain amino acids have been associated with metabolic disorders (Allison et al., 2021), however much less is known about the role of fecal-bacteria-generated unbranched amino acids and further research is warranted.
Noteworthy, in KEGG modules and GBMs, Tryptophan biosynthesis was higher in young patients with SSD than in controls.Previous studies have observed increased levels of the kynurenine/tryptophan ratio in blood and the central nervous system (CNS), indicating an ongoing increased tryptophan degradation in blood and CNS of SCZ (Almulla et al., 2022;Chiappelli et al., 2016).Zhu et al. found a higher fecal GBM Tryptophan degradation compared to controls, however they only evaluated the GBMs from a selected number of species found to be associated with early SSD/relapsed SSD (Zhu et al., 2020b).Furthermore, the abundances of the genes encoding the Butyrate synthesis I and II module and the Acetate degradation module were observed to be lower in young SSD patients compared to controls.Butyrate and acetate are short chain fatty acids (SCFAs) -small molecules that are important for the intestinal cell homeostasis, being produced predominantly by anaerobic gut bacteria during fermentation of dietary fibers.They pass to the circulation, and primarily butyrate has been shown to be immunemodulatory, and affect central processes in the microbiota-gut-brain axis such as maintenance of the barrier function both in the gut and at the blood-brain barrier, as well as maturation and activation of microglia cells in the brain (Braniste et al., 2014;Caspani & Swann, 2019).Noteworthy, a lower abundance of butyrate-producing gut microbes has been associated with an increased risk of metabolic disorders (Coppola et al., 2021).

Antipsychotic use and microbiome diversity
Previous in vitro and rodent studies have suggested an impact of antipsychotic drugs on the gut microbiome (Davey et al., 2012;Maier et al., 2018;Singh et al., 2022;Skonieczna-Żydecka et al., 2019).Specifically, an in vitro study showed that several antipsychotic drugs have an anti-commensal activity on a number of gut bacterial strains (Maier et al., 2018).Nevertheless, the association between antipsychotic use and the gut microbiota remains poorly understood and inconclusive.In our cohort, only nine patients were not on any antipsychotic medication.Even with such a small group as n=9, a significantly lower evenness was detected in young SSD not taking antipsychotics.In our cohort there was no detected difference in β-diversity.Further studies with larger sample sizes are warranted to confirm whether and how antipsychotics influence the gut microbiome.

Limitations
Our study has several additional limitations that need to be acknowledged: (i) the use of different food-frequency questionnaires between controls and young SSD patients.The young SSD patients completed a 9 food-unit questionnaire, while the controls completed a questionnaire including 57 common food units questionnaire where data were reformatted to the former 9 food-units.Furthermore, selfreported food questionnaires do not fully capture the food intake, see results S3.4 and Fig. S7.Nevertheless, when including diet as a covariable the results were not heavily affected.(ii) In our study only DNA was analyzed.Thus, in this article we are only presenting the potential of the genes to be expressed.The activity levels of, e.g., enzymes still need to be confirmed using meta-transcriptomics. (iii) A major limitation is the difference in age and BMI between the controls and young SSD patients.However, a sensitivity analysis, including 23 controls and 25 young SSD patients from the original cohort, selecting individuals so that age and BMI were similar between patients and controls, confirmed the findings from the total cohort (see supplementary result section).(iv) We did not have reliable information on smoking.(v) The samples of this cohort were sequenced in two different batches, with same sequencing pipelines and platforms.However, there was an increased sequencing depth in the second batch.This was adjusted for in the statistical analyses.Preferably all samples should be sequenced in the same round, or they should be randomized into rounds.(vi) The nature of this study did not make it feasible to investigate causality.(vii) The case-control analysis was not adjusted for psychotropic medication, as only antipsychotic medication associated with diversity (and that was only evenness), and only few patients did not take antipsychotics, and none of the controls had any medication.(viii) We cannot exclude that negative findings are false, especially considering long-term effects of the physical exercise, reported in Supplementary Information, (methods S2.9, results S3.8, S3.9), due to that only a smaller patient group (n=24) had both baseline and follow-up fecal samples.

Conclusion
We conducted a shotgun metagenomics study investigating the fecal bacterial microbiome in young patients with SSD, being one of two shotgun studies looking at fecal shotgun sequencing data in early SSD.Compared to controls, young SSD patients had significantly lower α-diversity and different β-diversity regarding both bacterial and functional data.Taxonomic and functional data classified young SSD individuals with an accuracy of ≥ 70% and with an AUROC of ≥ 0.75.A majority of the species with higher abundance in young SSD patients had their natural habitat in the oral cavity and many of the modules with higher abundance in young SSD patients were amino acid biosynthesis modules.We provide novel findings of antipsychotic medication association with lower evenness.Our findings continue to support the presence of gut microbiome alterations in SSD.Additional studies are warranted to replicate our findings and to provide mechanistic insights into the putative functional role of gut microbiome changes in the pathophysiology of SSD.Understanding the pathophysiology in the early stage of SSD is important as early effective treatment can shape the long-term prognosis.

Figure 1 .
Figure 1.The fecal microbiome in young SSD patients compared to controls.A) The experimental design.In total, 104 individuals were included in the analyses: young adults with SSD, n=52, and adult healthy controls, n=52.Shallow shotgun sequencing was used to sequence the fecal samples resulting in taxonomic (bacterial species) and functional (bacterial genes encoding KEGG modules, GBMs and KEGG enzymes) data.In addition, psychiatric and diet-scale scores were collected.In addition to the 52 baseline samples, 24 patients participated in a 12-week intervention with weekly group exercise sessions, these analyses can be found

Figure 3 .
Figure 3. α-diversity in young SSD patients on antipsychotic medication compared to young SSD patients not on antipsychotic medication.A) Species, B) GBMs, and C) Enzyme Shannon α-diversity in controls, n=52, young SSD patients not taking antipsychotic medication, n=9, and those on antipsychotic medication, n=43.The non-adjusted Shannon values are visualized in the plots.Differences were tested using linear regression adjusted for age, sex, BMI and read depth.Data is presented in box plots with boxes showing the 25 th , 50 th and 75 th percentiles and the whiskers extending to the largest/smallest values no further 1.5 * inter-quartile range from the hinge.

Table 1
Baseline demographic and clinical characteristics of the participants included in the downstream analyses Measured in hospitals in Stockholm according to the normal clinical chem- a= During the fecal sampling none of the participants had an ongoing antibiotic treatment.b=analysis(PCA) based on the food questionnaire data to capture diet intake.This PCA was done on all individuals included in Table1.Drinking alcohol = based on The Alcohol Use Disorders Identification Test (AUDIT-C).AUDIT-C score = 0 is Not drinking, AUDIT-C score > 0 is Drinking.HbA1c = Glycosylated hemoglobin IQR= inter quantile range NA = Not available SSD= schizophrenia spectrum disorders M.Stiernborg et al.