Cellular and transcriptional diversity over the course of human lactation

Human breast milk is a dynamic fluid that contains millions of cells, but their identities and phenotypic properties are poorly understood. We used single-cell RNA-seq (scRNA-seq) to characterize the transcriptomes of cells from human breast milk (hBM) across lactational time from 3 to 632 days postpartum in 15 donors. We find that the majority of cells in human breast milk are lactocytes, a specialized epithelial subset, and cell type frequencies shift over the course of lactation yielding greater epithelial diversity at later points. Analysis of lactocytes reveals a continuum of cell states characterized by transcriptional changes in hormone, growth factor, and milk production related pathways. Generalized additive models suggest that one sub-cluster, LALBAlow epithelial cells, increase as a function of time postpartum, daycare attendance, and the use of hormonal birth control. We identify several sub-clusters of macrophages in hBM that are enriched for tolerogenic functions, possibly playing a role in protecting the mammary gland during lactation. Our description of the cellular components of breast milk, their association with maternal-infant dyad metadata and quantification of alterations at the gene and pathways levels provides the first detailed longitudinal picture of human breast milk cells across lactational time. This work paves the way for future investigations of how a potential division of cellular labor and differential hormone regulation might be leveraged therapeutically to support healthy lactation and potentially aid in milk production.


INTRODUCTION 44
Human breast milk (hBM) is the nutritional food source evolved specifically to meet 45 the needs of infants. 1 Feeding exclusively with hBM is currently recommended for the first 46 six months of life, and this is one of the strongest preventative measures against mortality 47 in children under 5 years old. 2 In addition, breastfeeding has been linked to long-term 48 health benefits for both infants and nursing mothers. 1,3,4 Breastfed infants have decreased 49 infections 5 , improved gut and intestinal development 6 , and improved regulation of weight 50 long after termination of breastfeeding. 7 Additionally, nursing mothers have a decreased 51 risk of ovarian and breast cancers. [8][9][10] Given that lactation and nursing provide 52 unprecedented health benefits to mothers and infants, there is a need to better 53 understand the molecular and cellular features of hBM, and broadly, how these may 54 correlate with maternal and infant lifestyles and health. 55 The stages of lactation are canonically described as colostrum (0-5 days 56 postpartum), transitional (6-14 days postpartum), and mature (>15 days postpartum) 57 followed by involution, which begins within hours of the cessation of lactation. 11,12 During 58 pregnancy, lactation and involution, the human mammary gland undergoes drastic 59 remodeling that requires coordinated shifts in tissue architecture and cellular composition 60 guided by hormonal cues. 13,14 During lactation, the cells of the mammary gland are 61 responsible for synthesizing and transporting the diverse components of hBM as well as 62 responding to tightly regulated and highly responsive signals maintaining lactational 63 viability. A mechanistic understanding of the cellular composition, activities, and 64 regulation of the human mammary gland in the period between the establishment of 65 lactation and involution is essential for understanding environmental factors that impact 66 milk production, the responsiveness of the breast to the changing nutritional needs of the 67 infant, and the mechanisms of long-term lactation. However, given the unique nature of 68 this tissue niche, it is challenging to study lactating tissue directly in humans. To date, several studies have used either bulk 12,25-28 or single-cell RNA-90 sequencing (scRNA-seq) 16,29 to study the transcriptome of hBM in small cohorts. These 91 studies have revealed subsets of epithelial cells in hBM, as well as progenitor luminal 92 cells, and genes that change in bulk over the course of lactation. Bulk analysis, however, 93 limits our ability to delineate key cell states and uncover specialized cell phenotypes. 30,31 94 scRNA-seq analyses to date have also been limited by low sample numbers and small 95 donor pools, thereby decreasing the ability to characterize the cross-donor heterogeneity 96 of breast milk longintudinally. 1,28,32 Longitudinal studies of other factors in milk 97 composition have characterized dynamic shifts in hormone concentrations 33-35 , cytokine 98 content 36-38 , and overall protein content 39 up to 3 months postpartum suggesting that 99 most components decrease in concentration early in lactation. However, no 100 transcriptomic studies to date have captured the full range of lactation across time. How 101 these dynamic milk changes relate to lactocytes and immune cell function, are also not 102 well understood. 11 103 In order to better understand cellular dynamics and longitudinal lactational 104 heterogeneity, we sought to characterize the transcriptomics of hBM-derived cells using 105 scRNA-seq on longitudinal samples. hBM was collected longitudinally from 15 human 106 donors across various stages of lactation (Supplemental Table 1, Figure 1A). For each 107 sample, we collected a rich set of information about the mother-infant dyad, including 108 vaccine history, illness, and daycare status. To our knowledge, we have generated the 109 first single-cell analysis of hBM-resident cells over the course of lactation, with a dataset 110 comprised of over 48,478 cells from 50 samples. We identify key cell subsets, including 111 immune cells and epithelial cells at each lactation stage. We further identify several 112 factors that are associated with alterations in cell frequencies over lactation, including the 113 use of hormonal birth control and the start of daycare. We also nominate many pathways 114 and genes that are altered in epithelial subsets over the course of lactation, including 115 To identify marker genes for celltype clusters whose specificity to Leiden clusters or cell 194 subgroups was consistent across donors and samples, we utilized pseudobulk marker 195 gene identification [47][48][49] . Raw gene expression counts were pooled by sample and cluster 196 such that one pseudobulk population was created for each cluster found in each sample. 197 Psuedobulk groups were filtered to include only sample-subcluster pairs containing at 198 least 10 cells. Differential expression between clusters of one celltype and all other 199 clusters was executed using a Wald test in DESeq2 50 with the design formula "~donor + 200 is.thiscelltype" where the factor 'is.thiscelltype' is set to TRUE for pseudobulk populations 201 from the cluster of interest and FALSE for other clusters. These pseudobulk marker genes 202 were filtered for adjusted p value < 0.05, percent expression of single cells in the cluster 203 > 30%, and DESeq2-calculated log2 fold change > 0.4. Pseudobulk marker genes of all 204 cell types (Supplemental Table 2) and epithelial cell groups (Supplemental Table 3) and 205 top marker genes sorted by difference in percent of cells expressing in-cluster compared 206 to out-of-cluster are visualized in Figure S7E and Figure 3D, respectively. 207 208 Epithelial cell sub-clustering. 209 Epithelial sub-clustering was performed on combined cells from all samples to identify 210 major cell states within the data and characterize their changes in gene expression over 211 the course of lactation. To enable these analyses, we identified cell groups which were 212 either distinct enough to be robust to clustering parameter selection, or, for groups of cells 213 whose core identifying gene expression profiles could not be defined with respect to other 214 clusters, similar clusters were merged and further analysis identified genes changing over 215 time. Sub-clustering proceeded by re-discovering the top 3,000 variable genes on the 216 epithelial subset, re-running PCA on these genes, and clustering with Leiden clustering 217 with resolution 0.5 and 10 neighbors on 22 principal components ( Figure S4A). Clusters 218 0, 1, 2, and 3 were merged into the secretory lactocyte cluster due to shared expression 219 of various canonical lactation-related genes ( Figure S7F). Despite many shared functions 220 with clusters 0, 1, 2, and 3, cluster 5 was left as its own cluster due to high mitochondrial 221 gene percentage ( Figure S7G). Clusters 9, 6 and 8 shared a distinct transcriptional 222 signature and were merged into the LALBA low epithelial cluster. Clusters 4 and 11 were 223 merged into a single KRT high 1 cluster due to cluster 11's specificity to a. single donor, 224 and cluster 7 remained as a single KRT high 2 cluster. Additionally, these clusters were 225 robust to leave one out clustering. 226 227 Immune cell sub-clustering. Immune cells were sub-clustered separately and re-filtered 228 to remove additional doublets. To accomplish the latter, immune cells were clustered with 229 a known subset of secretory epithelial cells from our epithelial cell data. This allowed us to generate a gene signature derived of PC1-specific genes to define lactocytes or 231 monocytes with high confidence (Supplemental Table 4). We performed module scoring 232 with these in R (v3.6.2) with Seurat (V3), allowing us to stringently filter for immune cells 233 that scored highly for lactocyte gene expression (>2.5 standard deviations above the 234 mean lactocyte module score) 51 . Finally, we identified any additional doublets based on 235 dual expression of key lineage markers as described above. We performed sub-clustering 236 analyses by re-normalizing the data, finding the top 2000 variable genes, re-scaling the 237 data, running PCA, then performing additional UMAP visualization with the first 15 238 principal components. Supervised marker gene identification was performed across cell 239 types using Seurat's Wilcoxon rank-sum test. We also performed sub-clustering analyses 240 on the monocytes and macrophages as these were the most abundant immune cell type. 241 These cells were re-normalized, the top 2,000 variable genes were identified, and the 242 data was clustered across several resolutions to identify resolutions that produced non-243 redundant clusters (resolution = 0.2) as determined by marker gene identification using 244 Seurat's Wilcoxon rank-sum test. 245

Identification of time-varying genes. 246
Time-associated genes were identified for each cluster using pseudo-bulk analysis. First 247 the raw counts all cells in each sample in each cluster were summed to create sample 248 and cluster specific pseudobulk data. Then DESeq2 was used to identify genes varying 249 over the course of lactation in each sub-cluster using a likelihood ratio test between the 250 design formula "~ 0 + donor + days_post_partum" over "~0 + donor". Samples with a 251 minimum of 10 cells in a cluster were included in the analysis, and samples from more  Table 5A,B). 263 We classified universal epithelial cell time varying genes as genes associated with time 264 and changing in the same direction in both LALBA low epithelial and secretory epithelial 265 subsets (Supplemental Table 5C,D). Time varying genes in opposite directions in the 266 LALBA low epithelial and secretory epithelial subsets were also identified (Supplemental 267   Table 5E). 268

Identification of metadata associated cellular populations 269
Associations between collected covariates and cellular population proportions were  Table 6). In 281 cases where multiple covariates were significantly associated with one celltype 282 proportion, a model including both was run. Specifically, LALBA low epithelial cell 283 proportion was modeled as 'LALBA low _proportion ~ donor + daycare + 284 hormonal_birthcontrol + s(time_post_partum_days, k=7)'. Full model results are shown 285 in Supplemental Table 6. 286 287

Functional enrichment analysis on epithelial cells. 288
Functional enrichment analysis on top marker genes was performed using Enrichr using 289 the gseapy package with the gene set GO_Biological_Processes_2021. 53,54 Due to the 290 hierarchical structure of the GO database and the overlapping functions of many of the 291 marker genes of the epithelial cell subclusters, representative GO terms were identified 292 through a series of filtering and curating steps. For each subcluster, significantly enriched 293 terms were grouped based on shared marker genes found to be overlapping with the GO 294 term. These grouped terms were further grouped between subclusters based on shared 295 term ID or shared genes. The mean gene set score was calculated for each epithelial cell 296 group and enriched GO term using the scanpy function "score_genes". For each group 297 of GO terms, the terms with the highest variance of mean gene scores across epithelial 298 subgroup was chosen such that each epithelial subgroup had between 7 and 15 GO terms 299 for which they had the maximum mean gene score. To avoid redundant terms, GO terms 300 were also merged based on high overlap of genes in the full reference GO term gene list. 301 Heatmap visualizations display per-subset mean gene set score for all genes in the GO 302 term z-scored across subsets. Time-dependent enriched GO terms were identified for 303 genes positively and negatively associated with time postpartum separately and for both 304 LALBA low epithelial and secretory lactocyte clusters. These GO terms were similarly 305 curated with an additional filtering step of correlation of the gene set scores over time 306 postpartum in the same direction as the set of differential genes used (e.g. positive 307 correlation for GO terms enriched in the gene list increasing with time). GO terms 308 identified to be changing in the same direction in both the LALBA low epithelial and 309 secretory lactocyte clusters were considered epithelial cell-wide time-varying processes.

We identify major cell types of the breast epithelium and immune cells in 325
human breast milk over the course of lactation. We first optimized a process for 326 generating scRNA-seq data from cells in hBM. Previous studies characterize how sample 327 handling, as well as methods used for cell isolation, can significantly impact the 328 transcriptomes of isolated cells. 55, 56 We compared several workflows for upstream 329 handing of collected hBM -including: fresh isolation of cells, holding at 4°C overnight 330 until cell isolation, and a single freeze thaw of whole milk before isolating cells, as well as 331 several methods for isolating cells, including: sorting live cells, live cell enrichment with a 332 bead-based kit, or centrifugal isolation of fresh cells as previously described 333 (Supplemental Figure 1). We found that for each method, except for freezing, quality 334 control metrics were comparable and we identified expected cell types in milk, including 335 epithelial and immune cell subsets (Supplemental Figure 1B and C). Fresh processing, 336 sorted cells, or live-enriched cells clustered together in PC space, suggesting little gain 337 by additional processing prior to performing scRNA-seq. Additionally, we found that in 338 one donor, fresh but not frozen processing allowed us to retain macrophages 339 (Supplemental Figure 1D). In agreement with previous studies, we found that isolation of 340 cells from fresh milk resulted in the highest quality data and we therefor used this method 341 for our samples analysis. hBM-derived cells, and the overall composition of our cohort metadata, we plotted total 383 cell counts and cell type frequencies over time for each sample in our cohort (Figure 2A). 384 We found that the total cell counts per milliliter of milk decreased over the course of 385 lactation, agreeing with previous literature showing a decrease in total cell counts in 386 mature milk (Supplemental Figure 3). 25 We also found that the majority of our cohort were 387 directly breastfeeding, with 5 donors (9 samples) additionally supplementing with formula 388 and six donors (9 samples) reporting supplementation with solid foods. Several donors 389 reported breast soreness periodically over the course of the study, with only one donor 390 reporting mastitis at sample collection (Supplemental Table 7). Additionally, none of our 391 donors reported menstruating at the time of sample donation and four were on hormonal 392 birth controls or other reported medications. Finally, we had three donors that had begun 393 weaning and six whose children had started daycare during our study. Globally, the 394 variability in reported metadata allowed us to determine how cellular composition may be 395 impacted by shifts in time, lifestyle and maternal and/or infant health status. 396 We tested the association between the abundance of identified cell types with any 397 reported metadata using generalized additive models (Supplemental Table 6). While we 398 found that nothing was significantly associated following correction for multiple 399 hypotheses, we did find some associations indicating potential heterogeneity. We found 400 that macrophage 1 proportion associated with formula supplementation, LALBA low 401 epithelial cell proportion positively associated with daycare attendance and with use of 402 hormonal birth control, and dendritic cell proportion negatively associated with use of hormonal birth control and with infant vaccinations (Figure 2A). We noted that a 404 substantial amount of variability in these cell compositions can be attributed to individual 405 donors with a single donor consistently showing substantially larger macrophage 406 proportions (BM05) and all of the fibroblast cells coming from two donors (BM16, BM17). 407 We acknowledge that given our study design, often donor is conflated with certain 408 metadata features. 409 We next sought to refine our understanding of which cell types were correlated were macrophages, agreeing with previous literature. 60 We next wanted to better 437 understand the potential functions and phenotypes of macrophages in hBM given that 438 their percentages were altered in response to formula supplementation. We performed 439 sub-clustering analyses and functional enrichment of marker genes that were identified 440 for each sub-cluster ( Figure 3A and 3B, Supplemental Table 8). We found five sub-441 clusters of macrophages that span lactation stage, where macrophage sub-cluster 0 is 442 predominantly from early milk stages, and macrophage sub-cluster 3 is predominantly 443 from later stages and donor BM16. Macrophage sub-clusters were defined by distinct 444 gene signatures and pathway enrichment results ( Figure 3B and C). Macrophage sub-445 clusters 0, 1, and 4 were defined by pathways related to interactions with T cells, 446 neutrophils, and immune tolerance, including IL-10 and PD-1 related pathways. These 447 enrichments were driven by unique sets of genes present in each sub-cluster 448 (Supplemental Table 11). Interestingly, macrophage sub-cluster 0 was defined by several Finally, we determined if three meta-data variables of interest, including infant 473 medical events, weaning status, and daycare status, had any compositional variation 474 across sub-clusters ( Figure 3E). We found that sub-cluster 0 had the highest proportion 475 of reported medical events, which includes both vaccines and illness. Second, we found 476 that weaning-derived macrophages were predominantly found in sub-cluster 4 (  The cycling epithelial sub-cluster was defined by the expression of cell-cycle genes 518 (STMN1, TOP2A) and was enriched for cell-cycle related processes as well as several 519 metabolic processes, tetrahydrofolate metabolic process and purine nucleobase 520 biosynthetic process. This sub-cluster is also composed entirely of cells whose cell-cycle 521 score indicated they were in the G2M and S phases (Supplemental Figure 4C). The MT-522 high cluster was defined by similar gene expression to the secretory epithelial cells but 523 with higher mitochondrial gene proportion (Supplemental Figure 4G). While mitochondrial 524 RNA percentage is often used as a metric for dead or dying cells in scRNA-seq analysis, 525 we maintained this cluster in the dataset because it met our very conservative threshold 526 for mitochondrial RNA percentage, showed an interesting trend of increasing proportion 527 over time, and may relate to altered metabolic activity in these cells. 68 528 The KRT high lactocyte 1 cluster was defined by expression of cytoskeleton and 529 structural genes (S100A9, KRT15, KRT8, VIM) as well as immune response genes 530 DEB1, IFITM3, CD74, HLA-B). This sub-cluster is enriched in genes in the actin 531 polymerization or depolymerization pathway, positive regulation of defense response, 532 positive regulation of translation pathways as well as several signaling pathways. The 533 KRT high lactocyte 2 sub-cluster was enriched for similar pathways to the KRT high 534 lactocyte 1 group, but this sub-cluster shares fewer high-scoring pathways with the 535 LALBA low lactocyte sub-cluster suggesting more of a supporting role in milk production. 536 Finally, we determined how these sub-clusters were changing in proportion as a 537 function of lactation stage ( Figure 4E). Globally, we found that the cellular composition of 538 later lactational timepoints was more diverse as compared to earlier time points, where in hBM. So we next asked which genes and pathways also changed over the course of 565 lactation in epithelial cells. To accomplish this, we performed differential expression with 566 pseudo-bulk populations across time postpartum within each epithelial sub-cluster (see 567 Methods). We found that there were many genes that were differentially expressed over 568 time across all epithelial cells, including several that decrease over time such as APP, 569 KRT15, and FTH1, and several that increase over lactational time, such as LYZ and TCN1 570 ( Figure 5A). Lysozyme, encoded by the transcript LYZ, one of the most abundant 571 bioactive components of milk, has previously been shown to increase in later stages of 572 lactation. 69 573 Many genes were altered over the course of lactation that were unique to each 574 identified epithelial sub-cluster, with the majority of differentially expressed genes 575 identified in LALBA low epithelial and secretory lactocyte sub-clusters (Supplemental Table  576 5A-B). Enrichment analyses of these differentially expressed genes (see Methods), 577 identified both shared and distinct pathways that changed with time in both cell sub-578 clusters ( Figure 5B, C and Supplemental Figure 6A). Several pathways change in 579 expression over the course of lactation in secretory lactocytes, including a decrease in 580 gluconeogenesis and oxidative phosphorylation over time, and an increase in the 581 regulation of secretion and lipid metabolic processes ( Figure 5B). The cholesterol 582 biosynthesis pathway is enriched in both cell sub-clusters, but increases over time in 583 secretory lactocytes and decreases over the course of lactation in the LALBA low epithelial 584 sub-cluster ( Figure 5C). Additionally, over the course of lactation, pathway scores for 585 TGF-beta signaling, chromatin remodeling factors, cytoskeletal transport, vesicle 586 mediated transport, and apoptosis all increase in LALBA low epithelial cells with time 587 postpartum. Taken together, we identified many pathways that are differentially altered 588 with lactation time in the major sub-clusters of epithelial cells. 589 In order to nominate key genes and factors that might be responsible for pathway-590 level changes in these two sub-clusters, we looked at the expression of key regulators 591 that were differentially expressed with time postpartum, including those important for 592 hormone signaling, growth factor signaling, AP-1 signaling, factors involved in STAT5 593 signaling, and several milk production component genes ( Figure 5D with less potential bias. Given our limited sample processing (e.g. no staining or sorting), 666 we may have also recovered more macrophages than previous studies. To our 667 knowledge, our study is also the first to correlate maternal-infant dyad metadata with cell 668 proportions over the full course of lactation. We found that the proportion of LALBA low 669 epithelial cells and B cells were associated with time postpartum using generalized 670 additive models; however, we acknowledge that the overall frequency of B cells in our 671 final dataset was low and precluded more in-depth analyses. Given that B cells are critical 672 to the production of antibodies and these in turn shape early immune system We provide, in great detail, the epithelial cell sub-clusters in which key genes are 699 changing across both time and many donors, allowing us to gain insights into potential 700 alterations in milk transport, synthesis, and production. The LALBA low epithelial cell 701 cluster, whose marker genes are enriched for genes involved in tight junctions, increases 702 in abundance over the course of lactation while we see a decrease in the proportional 703 abundance of the secretory lactocyte sub-cluster whose core enriched functions involve 704 milk component synthesis and secretion. We also see a decrease in milk component 705 synthesis related genes (UGP2, CHRDL2) ( Figure 5A) and a decrease in the GO terms 706 gluconeogenesis, hexose biosynthetic process, glucose metabolic process over time in 707 both clusters ( Figure S6). This might suggest a decrease in transcription of milk 708 component related genes over the course of lactation. Previous studies have shown a 709 linear decrease in overall protein concentration in milk over the course of lactation as well as decrease in concentrations of proteins involved in lactose and HMO synthesis. [76][77][78] In 711 addition, due to our long follow up study, we were able to capture late stages of mature 712 milk (late 2-4), when usually complimentary food are presented to babies, and milk 713 demand and production decreased over time. Increased cellular specialization and 714 altered abundance of epithelial sub-clusters that we describe may provide mechanistic 715 insights into changes in the maintenance of milk secretion over the course of lactation. 716 Future work should specifically seek to understand how this relates to milk component 717 production as synthesis in the mammary gland, transport from maternal serum, or milk 718 volume production. We thank the study participants and their families for enabling this research, Nancy 784 Tran and other members of the Shalek and Berger labs for thoughtful discussions and 785 feedback. We thank the Single Cell Portal, Terra, and Cumulus teams at the Broad 786 Institute for support on data processing pipelines and data sharing. This work was