Extensive phylogenies of human development inferred from somatic mutations

Starting from the zygote, all cells in the human body continuously acquire mutations. Mutations shared between different cells imply a common progenitor and are thus naturally occurring markers for lineage tracing1,2. Here we reconstruct extensive phylogenies of normal tissues from three adult individuals using whole-genome sequencing of 511 laser capture microdissections. Reconstructed embryonic progenitors in the same generation of a phylogeny often contribute to different extents to the adult body. The degree of this asymmetry varies between individuals, with ratios between the two reconstructed daughter cells of the zygote ranging from 60:40 to 93:7. Asymmetries pervade subsequent generations and can differ between tissues in the same individual. The phylogenies resolve the spatial embryonic patterning of tissues, revealing contiguous patches of, on average, 301 crypts in the adult colonic epithelium derived from a most recent embryonic cell and also a spatial effect in brain development. Using data from ten additional men, we investigated the developmental split between soma and germline, with results suggesting an extraembryonic contribution to primordial germ cells. This research demonstrates that, despite reaching the same ultimate tissue patterns, early bottlenecks and lineage commitments lead to substantial variation in embryonic patterns both within and between individuals. Somatic mutations obtained from laser microdissected biopsies of human tissues are used to reconstruct the developmental phylogenies of these tissues back to the zygote.

Starting from the zygote, all cells in the human body continuously acquire mutations. Mutations shared between different cells imply a common progenitor and are thus naturally occurring markers for lineage tracing 1,2 . Here we reconstruct extensive phylogenies of normal tissues from three adult individuals using whole-genome sequencing of 511 laser capture microdissections. Reconstructed embryonic progenitors in the same generation of a phylogeny often contribute to different extents to the adult body. The degree of this asymmetry varies between individuals, with ratios between the two reconstructed daughter cells of the zygote ranging from 60:40 to 93:7. Asymmetries pervade subsequent generations and can differ between tissues in the same individual. The phylogenies resolve the spatial embryonic patterning of tissues, revealing contiguous patches of, on average, 301 crypts in the adult colonic epithelium derived from a most recent embryonic cell and also a spatial effect in brain development. Using data from ten additional men, we investigated the developmental split between soma and germline, with results suggesting an extraembryonic contribution to primordial germ cells. This research demonstrates that, despite reaching the same ultimate tissue patterns, early bottlenecks and lineage commitments lead to substantial variation in embryonic patterns both within and between individuals.
All cells in the adult human derive from a single fertilized egg following a well-orchestrated procession of cell division, migration and differentiation during early development that continues throughout life. Tracing cellular lineages can elucidate these fundamental developmental processes and has been widely used in model organisms. Early lineage tracing experiments relied on light microscopy 3 , an approach that is still used for studying early embryogenesis 4,5 but is impractical for long periods of time and complex organisms. Genetic tagging approaches 6,7 allow detection of introduced markers at later points, but rely on a discrete period of genome scarring. Their invasiveness precludes their application to human development.
From fertilization onwards, cells of the human body continuously experience genomic damage, either from intrinsic causes or exogenous exposure to mutagens 8 . Although almost all DNA damage is repaired and the genome is replicated with extremely high fidelity, cells steadily acquire somatic mutations throughout life. Some mutations that are present in a cell are shared with other cells. Because identical mutations rarely happen by chance, shared mutations often indicate a shared developmental history. Once gained, mutations are rarely lost. Thus, past developmental relationships from embryonic, fetal, childhood and adult phases of life can be identified from mutations in adult cells 1,2, [9][10][11][12][13] . Direct reconstruction of developmental phylogenies requires readouts of somatic mutations from single cells. This is possible through single-cell genomics 14,15 , but this approach can suffer from genomic dropout and artefacts introduced during whole-genome amplification. Alternatively, single stem cells can be expanded in vitro and subjected to standard whole-genome sequencing 7,10 . However, the spatial relationship between cells or clones is usually lost. Recently, low-input whole-genome sequencing following laser capture microdissection has allowed reliable calling of somatic mutations in single cell-derived physiological units, such as colonic crypts 16 , while retaining microscopic spatial information. Here we reconstruct large-scale phylogenies of cells from three individuals. The use of laser capture microdissection allows us to combine genetic and spatial information to elucidate the embryonic patterning of tissues (Fig. 1a). 25 different tissues from PD28690, 66 from 12 tissues from PD43850 and 110 from 12 tissues from PD43851 were subjected to low-input whole-genome sequencing (Fig. 1a, Supplementary Table 1).

Article
To reconstruct phylogenies, only microdissections with clear single cell-derived clones were included. While the clonal organization of most tissue units sequenced was unknown, the distribution of variant allele frequencies (VAFs) of acquired single-nucleotide variants (SNVs) in each microdissection informs on the clonality of the cell population. For instance, cells in a colonic crypt all derive from a single stem cell, translating into VAFs distributed around 0.5 (ref. 16 ). We estimated the number of clones and their prevalence in each microdissection (Extended Data Fig. 1; Methods). To include a microdissection in the reconstruction of phylogenies, a clone needed to account for the majority of cells in the sample.
Using sufficiently clonal samples, we constructed phylogenies through maximum parsimony ( Fig. 1; Methods). In total, we used 187 samples from PD28690, 62 samples from PD43850 and 106 samples from PD43851. All phylogenies exhibit a 'comb-like' structure, meaning that most genetic sharing occurs in a short amount of time at the top of the tree. Terminal branch lengths convey the mutation burden of the most recent common ancestors, generating individual microdissected cell populations. Intestinal crypts exhibit a high burden, whereas the burden in seminiferous tubules is much lower 17 .

Embryonic dynamics and asymmetric contributions
The topologies of the phylogenies reveal the pattern of cell division during development, mostly during the first days of embryogenesis ( Fig. 2a-c). There are important qualifications in the interpretation of these trees. First, while every internal node depicts an ancestral cell, not all ancestral cells are represented as nodes in the phylogeny. Owing to cell death or extraembryonic allocation, early cells may be cryptically present in branches or absent from the phylogeny altogether. Second, not all cell divisions can be resolved into bifurcations. If no mutation occurs during a division, it becomes undetectable and results in a multifurcation (polytomy).
In all cases, the tree starts in a bifurcating manner for roughly two generations before preferentially generating large multifurcations. This is consistent with a higher mutation rate in the first few divisions followed by a sharp decrease, possibly due to activation of the zygotic genome and consequent acquisition of the full capacity for DNA repair 18 . During the first two divisions, we estimate a mutation rate of 2.4 per cell per generation (95% CI: 1.7-3.2), dropping to 0.7 per cell per generation (95% CI: 0.51-0.77) during subsequent cell divisions (Methods; Extended Data Fig. 2), consistent with previous estimates 12,18,19 . However, this pattern may be partly due to more cell death or extraembryonic contribution during the earliest divisions, thus including more cryptic cell divisions.
In all phylogenies, the most recent common ancestor for all individual tissue types is the root of the tree, which plausibly (but not necessarily) represents the zygote 13 (Extended Data Fig. 3). The commonality of the most recent common ancestor indicates that no tissue studied has a monophyletic origin beyond the cell that generates the entire embryo. This polyphyletic tissue origin is consistent with lineage commitment at later stages of embryogenesis and the mixing of cells in the epiblast.
Previously, an asymmetric contribution of the two reconstructed daughter cells of the zygote was observed, with one contributing twice as much to the adult body as the other 1,2,10,13 . This is possibly caused by a bottleneck of cells allocated to the inner cell mass (ICM), from which the embryo proper arises. The contribution of embryonic progenitor cells on a population level can be measured through the VAFs of the mutations that define them in large ('bulk') pieces of tissue.
For PD28690, 33 bulk samples from 18 different organs were used to assess the body-wide contribution of embryonic progenitors. Mutations in the two reconstructed daughter cells exhibit different mean  (Fig. 2a).
As the sum of VAFs of these two lineages is 0.5, no appreciable lineages are missing from the first bifurcation (Extended Data Fig. 3). This asymmetry is reflected in most individual bulk samples (Fig. 2d). Using approximate Bayesian computation (Methods), we estimate a bottleneck consistent with picking three cells at the eight-cell stage to establish the ICM (Fig. 2e, Extended Data Fig. 2). Further asymmetries are present in subsequent generations of PD28690. The major lineage splits into two unequally contributing lineages with mean VAFs of 0.27 and 0.12, while the minor lineage splits into mean VAFs of 0.15 and 0.02. A single bottleneck of cell allocation to the trophectoderm or the ICM would not account for these later asymmetries. Therefore, this suggests that multiple, successive bottlenecks shape the quantitative contribution of individual embryonic progenitors to the adult body.
For PD43851, bulk brain and colon samples were available. In the primarily ectoderm-derived brain sample, the stark asymmetry (93:7, 95% CI: 87.9-96.1) indicates that it was almost entirely derived from one of the two reconstructed first cells (Fig. 2c). However, in the primarily endoderm-derived colon sample, the asymmetry is 81:19 (95% CI: 74.3-87.3), which is significantly different from the asymmetry in the brain (P = 0.004, likelihood ratio test) (Methods). Besides the brain, the only other ectodermal tissue sampled was the epidermis, microdissections of which were exclusively derived from the major lineage. By contrast, endoderm-derived microdissections are distributed on the phylogeny (81: 19) in line with the asymmetry in bulk colon (P = 0.86, two-sided binomial test), but not the brain (P = 0.005, two-sided binomial test), confirming the difference in asymmetry. The minor lineage is ancestral to the majority of seminiferous tubule microdissections, most from a distinct daughter lineage. Their distribution (major lineage:minor lineage 35:65) significantly deviates from the asymmetry in the brain and colon (P < 10 −8 for both comparisons, two-sided binomial test). These differences in asymmetry between tissue types further indicate that multiple bottlenecks shape the contribution of embryonic progenitors to different tissues.
Taken together, the degree of asymmetric contribution of the first two cells is highly variable between individuals. In one individual (PD28690), these asymmetric contributions are similar across all tissues, whereas in another individual (PD43851), there are substantial differences between tissues, demonstrating the existence of strong bottlenecks beyond blastulation.

Clonal patterns and microscopic mosaicism
Next, we combined genetic information with histological data to reconstruct spatial cellular patterns laid out in development. The intestinal epithelium is a two-dimensional array of crypts oriented perpendicular to  Article the luminal surface. We compared the physical distance and mutational sharing between any two crypts from the same biopsy in our dataset and from 36 other individuals sequenced similarly 16 to identify patches of contiguous gut epithelium derived from single embryonic cells (Fig. 3a, b). Although many crypt pairs shared zero or fewer than 15 SNVs, compatible with shared origins in very early embryogenesis (Extended Data Fig. 4), there was a second peak of sharing with a median of 27 SNVs, which is compatible with shared origins later in development. Assuming a circular patch, we simulated embryonic patches and estimate a patch size of 301 crypts derived from a single most recent embryonic progenitor cell (95% CI: 251-368) (Methods). The shared mutation burden reflects when these ancestor cells existed. In comparison with mutation burdens at known time points from studies of the fetal intestine 20 and blood 19 , these patches are seeded at approximately 8 weeks after conception.
Clonal expansions with a high mutation burden, thus arising later in life, are generally confined to a few events in intestinal crypts (Extended Data Fig. 5). These expansions are accompanied by driver mutations in canonical cancer genes, such as APC, BRAF or GNAS. The notable exception is an extensive expansion in the prostate of PD28690, which is consistent with benign prostatic hyperplasia, but without a discernible driver mutation.
Although tissues with distinct microscopic structures, such as intestinal epithelium, lend themselves to this analysis, echoes of embryonic patterning are present in other tissues. For example, epidermal microdissections often yielded polyclonal samples, which can be decomposed into embryonic lineages using VAFs of branch-defining mutations. VAF patterns in epidermal microdissections reveal that some are aggregates of multiple embryonic clones, that is, a mix of different seeding events (Extended Data Fig. 6). Others manifest as a single clonal lineage in the early embryo, indicating the existence of a single developmental founder cell, the progeny of which dominates this sample.

Organ-wide mosaicism in the brain
The VAF profile of embryonic mutations in an individual bulk sample is a summation of the developmental history of every cell in the sample. It represents an aggregation of the founder cell types and developmental bottlenecks, recent and in the more distant past of the population, even if the mutations are acquired before lineage commitment.
To assess the broad spatial patterning laid out in development, 86 different bulk samples from PD28690 (including 40 samples from various brain regions) were subjected to high-depth targeted resequencing of embryonic variants. Samples were clustered on the basis of the soft cosine similarity of VAF profiles of embryonic mutations (Methods). Brain samples fell into six different clusters (Fig. 3d). Adjacent biopsies from the cerebrum and midbrain often belonged to the same cluster. This may be a consequence of a shared developmental path and the seeding of those areas by similar populations of cells. Moreover, different brain regions show a heterogeneity in the prevalence of different clusters, with one cluster prominent in the midbrain. The correlation between the spatial and genetic distance was significant in the cerebrum but not in the midbrain (P = 0.039 and P = 0.27, Mantel test, respectively). Hence, despite these samples being taken millimetres apart and consisting of many different cell types, there is greater sharing of specific VAF patterns of pregastrulation mutations between neighbouring regions of the cerebrum than between distant regions. This indicates that developmental bottlenecks and lineage commitments generate a spatial effect in the mosaicism of the cerebrum.

Separation of PGCs from somatic lineages
The human primordial germ cell (PGC) lineage is thought to be the first to completely segregate from other embryonic lineages in the pregastrulation embryo 21 . Embryonic mutations in descendants of PGCs, such as seminiferous tubules of the testis, acquired before this segregation may therefore be seen in tissues derived from the three germ layers, unlike those arising later.
In PD28690 and PD43851, microdissections of seminiferous tubules shared on average 7.0 and 8.7 SNVs, respectively, with any other tissue microdissection. The observed segregation in PD28690 was confirmed by the absence of spermatogonia-specific variants in bulk samples (Extended Data Fig. 7). Further sequencing data from 162 microdissections of seminiferous tubules from 10 individuals (Supplementary Table 1) showed that, on average, seminiferous tubules shared at least 4.5 mutations with matched bulk samples (Fig. 4a).
In 5 of 12 individuals, some seminiferous tubules shared no SNVs with matched bulk samples. For example, in PD40745, the phylogeny splits into three seminiferous tubule clades with only two making a detectable contribution to bulk colon (Fig. 4b). Similar patterns of early segregation have been observed in tissues with an extraembryonic origin or contribution, such as the placenta 13 or fetal blood 19 , because they experience different developmental bottlenecks to the embryo proper.
It has been hypothesized that human PGCs originate from the posterior epiblast or nascent amnion 21 , which develops soon after the formation of the ICM and embryonic implantation (Extended Data Fig. 8). The late post-implantation epiblast gives rise to the three germ layers, whereas the amnion does not. The pattern of segregation suggests that   a subset of PGCs develop from cells that do not eventually contribute to the embryo proper. As a result of multiple, successive bottlenecks, these cells sometimes share no mutations with cells that generate the three germ layers (Extended Data Fig. 8). Thus, the results here are consistent with an extraembryonic contribution to human PGCs, possibly arising from the amnion. This could also explain the strong deviation in asymmetry in seminiferous tubules from PD43851 compared with the colon and brain.

Patterns of mutations and recurrent LOY
Embryonic SNVs exhibit mutational signatures ubiquitous in normal tissues (Extended Data Fig. 9; Methods). In PD42034, two different SNVs at the same genomic site (C>A and C>T) were present in two neighbouring clades of seminiferous tubules and detectable in blood (Fig. 4c). This may represent an unrepaired DNA lesion passed on during the first observed cell division and subsequently repaired differently in the two daughter lineages, constituting an in vivo observation of DNA lesion segregation 22 in normal tissues.
Although we observed mitochondrial SNVs shared between samples, their patterns were consistent with zygotic heteroplasmy, recurrent mutations, clonal mixing or sharing of late mutations. Therefore, they did not provide further phylogenetic information on embryonic development (Extended Data Fig. 9).
Other types of somatic mutations (insertions, deletions, copy number and structural variants) occur at a much lower rate than SNVs, limiting their use for lineage tracing. The only shared copy number variant was loss of the Y chromosome (LOY), which affected 6% of microdissections from PD28690. The distribution of LOY on the phylogeny suggested 12 independent recurrent, probably late events rather than a single early event (Extended Data Fig. 10). LOY affected multiple samples of the bladder urothelium, bile duct, adrenal gland and renal distal tubules, suggesting a tissue-specific propensity mirroring that reported among cancer types 23 . LOY has been associated with various age-related disorders and genomic instability 24,25 .
In contrast to reported widespread aneuploidy in pre-implantation human embryos 26,27 , we observed no early aneuploidy in any individual studied. This suggests that aneuploid cells are purified from the embryo proper, either through cell death or allocation to extraembryonic lineages.

Discussion
This research exemplifies somatic mutations as lineage tracing markers of human development. By constructing phylogenies from many normal tissue samples, it is possible to deduce the life history of a cell from its inception as a zygote, through embryogenesis and early development to its final destination as a differentiated adult cell.
The variability between individuals and tissues in the asymmetric contribution of embryonic cells to adult tissues suggests that the precise lineage commitment of cells is not fixed. Unobserved effects, such as the spatial positioning of individual cells in the early embryo, might be at the heart of seemingly stochastic patterns of asymmetries, which, nevertheless, ultimately result in the same developmental outcome. Whereas the early segregation of the trophectoderm and the ICM might explain the global asymmetry in PD28690, later bottlenecks may tune the asymmetry in individual tissues, as seen in PD43851. To be detectable at the limited sequencing coverage of these bulk samples, these bottlenecks must be tight, that is, many lineages derive from small populations of cells during development. After blastulation, the most notable developmental bottlenecks are the differentiation between the epiblast and the hypoblast, amniogenesis, gastrulation and organogenesis. These later bottlenecks may have resulted in the strong spatial effect on genetic mosaicism in the cerebrum of PD28690. The observed early segregation of PGC lineages suggests an extraembryonic contribution to these cells and is plausibly the consequence of a bottleneck between the amnion and the epiblast. In addition to these bottlenecks, cell competition and selection, especially when involving aneuploidies, probably have a strong role in early embryogenesis 28 .
This research demonstrates the variability of human embryogenesis in the shaping of different lineages and tissues through developmental bottlenecks. Understanding patterns of normal human development can contextualize disorders of development, especially when caused by post-zygotic genomic aberrations, such as early drivers of childhood cancers or mosaic overgrowth syndromes 8 . Further studies using somatic mutations as natural lineage markers may help to address fundamental questions of human ontogeny.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41586-021-03790-y.

Ethics statement and sample collection
Metadata on all samples can be found in Supplementary Table 1. For donor PD28690, multiple samples from 25 macroscopically normal tissues and organs were collected from a 78-year-old man during a rapid autopsy (rapid autopsy is defined as an autopsy with a post-mortem time interval of less than 6 h). This donor was a non-smoker who died of a metastatic oesophageal adenocarcinoma for which he had received a short course of palliative chemotherapy (5-6 weeks of oxaliplatin 7 weeks before his death). He had no other comorbidities. The samples were obtained with informed consent and collected in line with the protocols approved by the NRES Committee East of England (NHS National Research Ethics Service reference 13/ EE/0043). Every sampled tissue was photographed and the biopsy sites were documented. Once collected, all tissue biopsies were snap frozen in liquid nitrogen and subsequently stored at −80 °C.
For donors PD43850 and PD43851, multiple biopsies from 12 different tissues were collected from a 54-year-old woman (PD43850) and a 47-year-old man (PD43851); both individuals died of causes not related to cancer (traumatic injuries and acute coronary syndrome, respectively). Similarly, all samples were obtained within less than 6 h of death (1 h and 3 h, respectively), were snap frozen in liquid nitrogen and subsequently stored at −80 °C. The samples were obtained with informed consent and the use of these tissues was approved by the London, Surrey Research Ethics Committee (REC reference 17/LO/1801, 26 October 2017).
For additional testis and colon samples (from AmsBio, a commercial supplier), samples for PD40744, PD40745, PD40746, PD42563, PD42566 and PD42569 were obtained at autopsy from individuals who died from causes not related to cancer. The use of these tissues was approved by the London, Surrey Research Ethics Committee (REC reference 17/LO/1801, 26 October 2017). Samples for PD42036, PD42034, PD43727 and PD43726 were obtained from individuals who had non-testicular problems such as abdominal chronic pain and tissue distortion. The samples were obtained with informed consent and the use of these samples was approved by North East Newcastle & North Tyneside Research Ethics Committee (REC reference 12/NE/0395, 22 September 2009). As above, all collected samples were snap frozen in liquid nitrogen and subsequently stored at −80 °C.
Bulk DNA sequencing DNA was extracted from bulk biopsies, short insert (450 bp) genomic libraries were constructed, flow cells prepared and 150 bp paired-end sequencing clusters were generated on the Illumina HiSeq X Ten platform according to Illumina PCR library protocols. An overview of samples, including the average sequence coverage, is shown in Supplementary Table 1.

Laser capture microdissection and low-input DNA sequencing
Tissues were prepared for microdissection and libraries were constructed using enzymatic fragmentation as previously described 13,16,[29][30][31][32][33] and subsequently submitted for whole-genome sequencing on the Illumina HiSeq X Ten platform. An overview of samples, including the average sequence coverage, is shown in Supplementary Table 1.

DNA sequence alignment
All DNA sequences were aligned to the GRCh37d5 reference genome by the Burrows-Wheeler algorithm (BWA-MEM) 34 .

Somatic variant calling
SNVs were called using the CaVEMan (cancer variants through expectation maximisation) algorithm 35 against an in silico unmatched normal .bam file-generated human reference genome (GRCh37d5). In addition to the default CaVEMan filters, putative SNVs were required to have a median alignment score (ASMD) of at least 140 and fewer than half supporting reads being clipped (median number of soft-clipped bases, CLPM = 0). Where whole-genome sequencing data were obtained from laser capture microdissection (LCM) samples through the low-input pipeline, an additional set of filters was used to remove specific artefacts, as previously described 32 . These artefacts include spurious variant calls due to the formation of cruciform DNA and the double counting of variants due to a negative insert size.
Short insertions and deletions (indels) were called using the Pindel algorithm 36 , with calling and filtering strategies the same as for SNVs. Copy number variants (CNVs) were called using the Allele-Specific Copy number Analysis of Tumours (ASCAT) 37 and the Battenberg algorithms 38 . Since both ASCAT and Battenberg rely on the sharing of SNPs between query and normal sample, they cannot be run against the in silico-generated unmatched normal sample. To detect potential early embryonic CNVs, both ASCAT and Battenberg were run on all LCM samples against bulk samples and, alternatively, against an LCM sample known to be derived from the other branch of the first split in the phylogeny, and hence genetically unrelated on a post-zygotic level. No early embryonic CNVs were present in any of the phylogenies.
To detect structural variants, the BRASS (breakpoint analysis) algorithm 38 was used, which relies on discovery through discordantly mapped reads and confirmation by local reassembly of the breakpoints of the structural variant. The strategy for matching samples was identical to the strategy described for ASCAT and Battenberg. Again, no early structural variants were identified in any of the phylogenies.

Filtering germline mutations
Given that inherited variants are expected to be present at least at a heterozygous level in all cells, a one-sided exact binomial test can be used on the aggregated counts of reads supporting the variant and the total depth at that site 8,11,13,[29][30][31][32][33] . This tests whether the observed variant counts are likely to have come from a germline distribution or from a distribution with a lower true VAF (probably somatic). For sex chromosomes in male individuals, the binomial probability (true VAF) for comparison was set to 0.95 rather than 0.5. The resulting P values are corrected for multiple testing using a Benjamini-Hochberg correction 39 . Any variant with a corrected P value less than 10 −5 was categorized as a putative somatic variant.

Filtering shared artefacts
To filter out recurrent artefacts, a beta-binomial distribution was fitted to the number of variant-supporting reads and the total number of reads across samples from the same individual 8,11,13,[29][30][31][32][33]40 . For every somatic SNV, the maximum likelihood overdispersion parameter (ρ) was determined in a grid-based way (ranging the value of ρ from 10 −6 to 10 −0.05 ). A low overdispersion captures artefactual variants, as they appear seemingly randomly across samples and can be modelled as drawn from a binomial distribution. By contrast, true somatic variants will be present at a high VAF in some, but not all, genomes, and are thus best represented by a beta-binomial distribution with a high overdispersion. To distinguish artefacts from true variants, a threshold was set at ρ = 0.1, below which variants were considered to be artefacts. The code for this filtering approach is an adaptation of the Shearwater variant caller 41 .

Estimating clonality by truncated binomial mixture model
To estimate the clonal structure within a sample, we used a truncated binomial mixture model 8,11,13 . The truncated distribution is used to reflect the minimum number of supporting reads (n = 4) that is required by CaVEMan. The model will try to separate the overall SNVs into the clones that they could have arisen from, each with their own probability (VAF) and proportion (the amount of variants a clone contributes). The proportion of cells that inhabit a clone can be approximated by twice the estimated VAF of the clone. As no two clones with a VAF higher than 0.25 can coexist independently, but must be an adequate readout of a single lineage, only samples containing a clone with an estimated peak VAF higher than 0.25 were used for tree building.

Mitochondrial SNV calling
We extracted the reads mapped to the mitochondrial genome from the bam files aligned to the GRCh37d5 reference genome and generated a pile-up (using the bam2R function in the deepSNV R package; https://github.com/gerstung-lab/deepSNV) for the entire mitochondrial genome for all samples. For this purpose, a mapping quality ≥ 30 was accepted. ShearwaterML 41 was then used to detect mitochondrial variants, leveraging polyclonal normal samples and bulks from this study and previous publications 29,33 to estimate a site-specific error rate. Overdispersion values were estimated from the normal panel with the lower and upper bound set to [2 × 10 −4 , 1× 10 −0.5 ]. The resulting P values per position call were subjected to Benjamini-Hochberg correction 39 and a q-value cut-off of 0.01 was used to call somatic mutations. Homoplasmic germline sites were marked as uninformative in the normal panel and removed from the Shearwater calls before the false discovery rate filtering. Finally, these variants were mapped to the nuclear phylogenies.

Phylogeny reconstruction and mutation mapping
Phylogenies of microdissections were generated from the filtered substitutions using a maximum parsimony algorithm: MPBoot 42 . MPBoot was run with default parameters (command line: mpboot -s <alignment file> -bb 1000, in which 'bb' denotes 1,000 bootstrap iterations) on concatenated nucleotide sequences of variant sites. We use the VAF to inform the state of the mutations: '1' (present) for VAF ≥ 0.3, '0' (absent) for VAF ≤ 0.1 and '?' (unknown) for 0.1 < VAF < 0.3 to allow uncertainty due to noise in the VAF. To ensure the proper anchoring of the root of the phylogeny, an artificial nucleotide sequence composed of the reference genome bases at variant sites was included to simulate the ancestral state at the zygote. MPBoot reports a majority rule extended consensus of equally parsimonious trees, which we then used as an input for mutation mapping.
A considerable source of uncertainty in our data was introduced by the depth of sequencing, which can be variable across samples and variant sites. In using the VAF to determine the character state of a variant site, we lost the information of the number of counts supporting the variant and the read depth at the variant loci. Therefore, the mapping of mutations to branches was done by calculating the likelihood of the observed read counts of a mutated site for all tree branches, given a binomial probability of 0.5 if the mutation was present and 0 if absent. The likelihoods of the mutation belonging to a branch were then normalized and the maximum likelihood branch was determined. This approach relies on the infinite sites assumption. Rather than mutations being 'soft' assigned with a probability on each branch, mutations were 'hard' assigned to the most likely branch. This approach is implemented in the 'treemut' R package (https://github.com/NickWilliamsSanger/ treemut). Any branch with a length of 0 after SNV fitting was collapsed into a polytomy. See Supplementary Tables 2-4 for the individual trees  with branch lengths. This method also outputs a 'p_else_where', defined as 1 minus the probability of the mutation belonging on the maximum likelihood branch. If there are two (or more) branches with similar probabilities of containing the mutation, the value will be large, whereas it will be vanishingly small if the mutation fits well on only one specific branch. Hence, this metric can be cautiously used to assess the branch fit of individual mutations and identify remaining recurrent artefacts and potential homoplasies. The homoplasies listed in Supplementary  Table 5 were flagged in this way (p_else_where > 0.01), with the additional requirement that the samples sharing this mutation were derived from different biopsies. This was to prevent the sharing being due to shared contamination or imperfect clonal decomposition.
In addition to the algorithms described above, the ape and ggtree packages were used for analysis and visualization of phylogenetic trees in R.
Further validation of the phylogenies, including a comparison against an alternative tree building algorithm (IQ-TREE) 43 and further testing of the monophyly of tissues, can be found in the Supplementary Methods.

Mutation rate in early embryogenesis
To estimate mutation rates in the early parts of the tree, we calculated the mean mutation burden per branch in the first three generations, in accordance with the burdens being Poisson distributed, an assumption consistent with the data. For later generations, we used the prevalence of polytomies to estimate the mutation rate. For more detail, please see the Supplementary Methods.

Estimating asymmetry and the likelihood ratio test
The number of variant-supporting reads (N V ) and total reads (N R ) for mutations on the major branch were modelled with a binomial distribution, with underlying probability p, while those on the minor branch have probability 1 − p. The likelihood of the model was then given as the binomial probability of seeing N V successes out of N R trials, given p (or 1 − p). A grid-based approach was used to calculate the maximum likelihood estimates of the binomial probabilities in each scenario. Confidence intervals around this estimate were calculated based on the profile likelihood.
We used a likelihood ratio test to calculate whether two bulk samples have a significantly different asymmetry. The null model was that the underlying binomial probability (asymmetry) is the same in the two bulk samples (p 1 = p 2 ; joint maximum likelihood estimate for the samples), while the alternative model was that these are different (p 1 ≠ p 2 ; separate maximum likelihood estimate for each sample).

Modelling of the first bottleneck in early embryogenesis
To assess the first bottleneck in early embryogenesis, we used a simple model of the segregation of the trophectoderm and the ICM and approximate Bayesian computation to estimate the most likely parameter values for each of the three individuals.
The parameters of the simulation were as follows: (1) µ 1 , the mutation rate per cell per division before zygotic genome activation (ZGA), which we assume to take place at the four-cell stage. µ 1 was drawn from a uniform distribution between 1 and 6. This assumes a clock-like (that is, Poisson distributed) acquisition of mutations. (2) µ 2 , the mutation rate per cell per division after ZGA, drawn from a uniform distribution between 0.1 and 1.5. This assumes a clock-like (that is, Poisson distributed) acquisition of mutations. (3) s, the stage at which cells are selected to contribute to the ICM. The number of cells in the embryo corresponds to 2 s . s varied from 2 (4-cell stage) to 6 (64-cell stage).
(4) n, the number of cells forming the ICM. n varied from 2 to 2 s−1 , the latter with a maximum of 20 cells.
The values of s and n were drawn uniformly from a table with all their possible combinations, such that each possible pair was represented equally.
Using these parameters, we simulated an embryo with a total of 1,024 cells (corresponding to stage 10). The branch lengths for this tree were drawn from a Poisson distribution with a rate parameter of µ 1 (until the four-cell stage) or µ 2 (beyond the four-cell stage). Subsequently, n cells (nodes of the phylogeny) at stage s were selected and all other elements of the phylogeny were trimmed. The following summary statistics were then extracted: the number of lineages descending from the root of the phylogeny (for example, 2 representing a bifurcation); the observed asymmetry in contribution of daughter lineages, taken as the fraction of tips belonging to the most populous clade; the mutation burdens of the first two branches; and the degree of multifurcation before a mutation burden of five SNVs, defined as the number of branches without any length divided by the total number of branches.
We ran a total of 2,000,000 simulations. Using the 'abc' package in R, for each individual, the posterior distributions for the parameter estimates were generated from the 1% best simulations as determined by the Euclidean distance of their summary statistics to the one observed for the individual (Extended Data Fig. 2).
To assess the possible effect of cell death on these estimates, we repeated this analysis with an additional parameter of cell death rate, varied between 0 and 0.1. For a given asymmetry, the most observed bottlenecks are the same with and without incorporating cell death, but many more solutions are now possible and observed (Extended Data Fig. 2).

Calculation of embryonic patch size in colonic crypts
To assess what generally happens in the development of colonic crypts, we supplemented the data from our individuals with a published dataset 16 (326 crypts from 36 patients), which uses the same sampling and sequencing strategy. We subjected this data to the exact same variant calling procedures. Physical distance was quantified in the number of crypts, where a distance of '1' denotes neighbouring crypts. Genetic sharedness was quantified as the number of SNVs in shared ancestral branches between a pair of crypts.
To avoid incorporating later crypt fission events, we excluded pairs of crypts that shared more than 200 SNVs. We then looked at the distribution of physical distances and genetic sharedness (Extended Data Fig. 5a). Many pairs of crypts from the same section shared none or fewer than 15 SNVs. These crypts derived from different daughter cells of the zygote or from progenitors that have shared only a few post-zygotic cell divisions. This sharing will have occurred in the very early embryo, before organogenesis, and is therefore uninformative for the estimation of embryonic patch sizes. There was a distribution of shared mutations above 15 SNVs, which reflects the sharing as a likely consequence of embryonic patch formation (Extended Data Fig. 5b).
To estimate the crypt patch size from our data, we simulated spatial sampling and used approximate Bayesian computation to evaluate the posterior distribution of radii of embryonic patches, which we assume to be circular in shape. In brief, the model works as follows: (1) sample an embryonic patch size radius (r) from a uniform distribution between 2 and 20 crypts (in steps of 0.01).
(2) Uniformly sample an integer point within a circle with radius r, centred around the origin. (3) Sample an integer distance from that point. To account for our bias in sampling, this distance was drawn from a distribution with the same distance frequencies as our data. (4) Evaluate whether the second point falls within the circle of radius r. (5) Repeat this as many times as there were observations in the original dataset. For each distance evaluated, return the counts of crypt pairs belonging to the same patch, which will act as the summary statistics. To avoid noisy observations, we included distances for which we had more than 10 observations, corresponding to the range of distances of 1-17 crypts.
This simulation was run 50,000 times. We approximate the posterior distribution of the patch size radius by accepting the 5% simulations with the lowest Euclidean distance to the original sharing counts per distance. This was followed by a neural network regression method, as implemented in R package 'abc', to gain a more precise estimate (Extended Data Fig. 5c). The posterior distribution estimated from rejection alone had a mean radius of 8.68 (95% confidence interval: 6.03-11.64), which translates to an embryonic patch size of 237 (95% confidence interval: 114-425). The weighted posterior distribution estimated from the neural network regression method had a mean radius of 9.79 (95% confidence interval: 8.94-10.82), which indicates an embryonic patch size of 301 (95% confidence interval: 251-368). The estimated size of the embryonic patches in the colon that we obtained from our sequencing data is consistent with the estimate obtained through X-inactivation studies 44 .

Targeted sequencing of embryonic variants
Following the identification of early embryonic variants in PD28690, a custom bait set for the Agilent SureSelectXT platform was designed using an online tool by Agilent. Repetitive regions were masked using the criterion of least stringency, as defined by Agilent. Each embryonic variant was covered by one tile. In addition to early variants, testis-specific somatic mutations, as well as heterozygous single-nucleotide polymorphisms, were included in the bait set. The testis-specific somatic mutations were included to assess the split between the germline and the somatic tissues, while the heterozygous single-nucleotide polymorphisms were used to assess any biases in the performance of the pulldown. DNA libraries from 86 bulk samples were made and subsequently hybridized with the baits. The sequencing was performed on the Illumina HiSeq 2500 platform.

Soft cosine similarity
To compare the similarity between two vectors of VAFs of variants positioned on a phylogenetic tree, a soft cosine similarity was calculated, which includes a specific similarity term to incorporate the dependence of observations 19 . That is, variants on the same branch of the phylogeny will convey the same information, whereas variants of different branches of the tree will provide parallel information on lineage composition. As such, an interaction term s i,j is introduced into the following definition of the soft cosine similarity: In this relation, a and b refer to two bulk samples and their vector of VAFs of embryonic mutations of length N, which are indexed by i and j. These VAFs are log-transformed. The similarity metric s i,j is defined as follows: To use the matrix of soft cosine similarities as a basis for clustering, it was converted to a soft cosine distance matrix by subtraction from 1 (that is, if the soft cosine similarity between a and b is 0.7, the distance will be 0.3). We then performed hierarchical clustering using the nearest neighbour complete linkage method.
Significant correlations between the spatial and genetic distance matrices were assessed using the Mantel test, which is independent of the clustering.

Embryonic SNVs and signature fitting
For the three individuals subjected to a rapid autopsy, an embryonic SNV was defined as one present on a branch that was ancestral to two different tissue types. In the ten men of whom primarily seminiferous tubules were samples, an embryonic SNV was defined as one shared between a seminiferous tubule and the matched bulk sample. Embryonic SNVs are listed in Supplementary Table 6.
Mutational signatures (COSMIC v.3.1) were fitted to the trinucleotide mutation counts using SigFit 45 in the same manner as previously applied 46 .

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

Data availability
The DNA sequencing data are deposited in the European Genome-Phenome Archive (EGA) with the accession codes EGAD00001006641 (whole-genome sequencing) and EGAD00001006643 (targeted sequencing). Fig. 8 | Early embryogenesis and bottlenecks. a, Overview of lineage commitments in the early human embryo, up until gastrulation and early organogenesis. The blue arrows indicate the putative contribution of extraembryonic cells to embryonic lineages (for yolk sac haematopoiesis and intercalation of the endoderm) or lineages with an unknown origin (primordial germ cells). b, Schematic of the possible influence of multiple, successive bottlenecks on the eradication of a specific lineage in a certain population of cells. The two daughter lineages of the zygote are coloured in red and blue. Note that this is a toy example merely for illustration and the relative cell numbers or size of the bottlenecks need not represent reality. Fig. 9 | Patterns of mitochondrial and nuclear SNVs. a-d, Phylogenies of nuclear SNVs with the VAF of mitochondrial mutations overlaid on them, showing a late shared SNV (a), an SNV that was heteroplasmic in the zygote (b), an SNV that is consistent with a shared subclone or stromal contamination (c) and an SNV recurrently acquired in samples from different tissues (d). e, Mutational spectrum and decomposition of early embryonic nuclear SNVs.

Corresponding author(s): Michael R Stratton
Last updated by author(s): Jun 22, 2021 Reporting Summary Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency in reporting. For further information on Nature Research policies, see our Editorial Policies and the Editorial Policy Checklist.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted

Software and code
Policy information about availability of computer code Data collection No software was required to collect data.