Time-resolved comparative molecular evolution of oxygenic photosynthesis

Oxygenic photosynthesis starts with the oxidation of water to O2, a light-driven reaction catalysed by photosystem II. Cyanobacteria are the only prokaryotes capable of water oxidation and therefore, it is assumed that the origin of oxygenic photosynthesis is a late innovation relative to the origin of life and bioenergetics. However, when exactly water oxidation originated remains an unanswered question. Here we use phylogenetic analysis to study a gene duplication event that is unique to photosystem II: the duplication that led to the evolution of the core antenna subunits CP43 and CP47. We compare the changes in the rates of evolution of this duplication with those of some of the oldest well-described events in the history of life: namely, the duplication leading to the Alpha and Beta subunits of the catalytic head of ATP synthase, and the divergence of archaeal and bacterial RNA polymerases and ribosomes. We also compare it with more recent events such as the duplication of Cyanobacteria-specific FtsH metalloprotease subunits and the radiation leading to Margulisbacteria, Sericytochromatia, Vampirovibrionia, and other clades containing anoxygenic phototrophs. We demonstrate that the ancestral core duplication of photosystem II exhibits patterns in the rates of protein evolution through geological time that are nearly identical to those of the ATP synthase, RNA polymerase, or the ribosome. Furthermore, we use ancestral sequence reconstruction in combination with comparative structural biology of photosystem subunits, to provide additional evidence supporting the premise that water oxidation had originated before the ancestral core duplications. Our work suggests that photosynthetic water oxidation originated closer to the origin of life and bioenergetics than can be documented based on phylogenetic or phylogenomic species


Evolution of Cyanobacteria
The origin of oxygenic photosynthesis is considered a turning point in the history of life, marking the transition from the ancient world of anaerobes into a productive aerobic world that permitted the emergence of complex life [1]. Oxygenic photosynthesis starts with photosystem II (PSII), the water-oxidizing and O 2 -evolving enzyme of Cyanobacteria and photosynthetic eukaryotes. PSII is a highly conserved, multicomponent, membrane protein complex, which was inherited by the most recent common ancestor (MRCA) of Cyanobacteria in a form that is structurally and functionally similar to that found in all extant species [2]. Thus, the origin of oxygenic photosynthesis antedates the MRCA of Cyanobacteria by an undetermined amount of time.
Cyanobacteria's closest living relatives are the clades known as Vampirovibrionia (formerly Melainabacteria) [3,4], followed by Sericytochromatia [5] and Margulisbacteria [6]. Described Margulisbacteria include intracellular symbionts of placozoans [7] or are found as part of a chain of complex symbiotic interactions in the gut of termites [8]. Vampirovibrionia also include gut symbionts as well as specialist predators of green algae [9]. Less is known of the Sericytochromatia [4]. Currently, no photosynthetic strains have been described in these groups of uncultured bacteria and this has led to the hypothesis that oxygenic photosynthesis arose during the time spanning the divergence of Vampirovibrionia and the MRCA of Cyanobacteria, starting from an entirely non-photosynthetic ancestral state [5,10]. Recent molecular clock studies have suggested that the span of time between the divergence of Cyanobacteria and Vampirovibrionia could be of several hundred million years [11,12]. However, it is still unclear from molecular clock analyses and the microfossil record when exactly the MRCA of Cyanobacteria occurred [13].

Evolution of photosystem II
The heart of PSII is made up of a heterodimeric reaction centre (RC) core coupled to a core antenna. The two subunits of the RC core of PSII are known as D1 and D2, and these are associated respectively with the antenna subunits known as CP43 and CP47. D1 and CP43 make up one monomeric half of the RC, and D2 and CP47, the other half. Water oxidation is catalysed by a Mn 4 CaO 5 cluster coordinated by ligands from both D1 and CP43 [14,15]. The cluster is functionally coupled to a redox active tyrosine-histidine pair (Y Z -H190) also located in D1, which relays electrons from Mn to the oxidized chlorophyll pigments of the RC during charge separation [16]. In a cycle of four consecutive light-driven charge separation events, O 2 is released in the decomposition of two water molecules.
Photosystems evolved first as homodimers [17,18]: therefore, the core and the antenna of PSII originated from ancestral gene duplication events that antedated the MRCA of Cyanobacteria. In this way, CP43/D1 retain sequence and structural identity with CP47/D2. The conserved structural and functional traits between CP43/D1 and CP47/D2 suggest that the ancestral PSII homodimer-prior to the duplication events-was not only structurally similar to heterodimeric PSII, but also that it could have split water and evolved protective mechanisms against the formation of reactive oxygen species (ROS) [18][19][20].
In a previous study, we attempted to measure the span of time between the duplication that led to D1 and D2 (dD0) and a point that approximated the MRCA of Cyanobacteria: a period of time that we called ΔT [19]. We observed that the magnitude of ΔT can be very large, well over a billion years. Such a large ΔT suggested that the origin of a water-oxidizing PSII substantially antedated the MRCA of Cyanobacteria and highlighted that the evolution of the process, prior to this ancestor, was likely marked by loss of photosynthesis and extinction events. This implies that despite their specialized heterotrophic lifestyles, an ancestral photosynthetic state at the divergence of Margulisbacteria, Sericytochromatia, and Vampirovibrionia (MSV) cannot be ruled out. However, our study neither provided an absolute age for the MRCA of Cyanobacteria nor the duplication event itself, as we simulated a comprehensive range of scenarios. Instead, we showed that even when the time-span of ΔT is over a billion years, the rate of protein evolution at the duplication point (dD0) needed to be over 40 times greater than any rate ever observed for D1 and D2, which decreased exponentially during the Archean and stabilized at current rates during the Proterozoic. Thus, the smaller the value of ΔT, the faster the rate at dD0, with the rate increasing following a power law function and reaching unrealistic values even when ΔT is still in the order of several hundred million years [19]. It was unclear then if such patterns of molecular evolution were unique to the core subunits of PSII or whether other systems have experienced similar evolutionary trajectories.
Here, to help in understanding the evolution of oxygenic photosynthesis, Cyanobacteria and MSV as a function of time, we compared the age of duplication leading to the RC antenna subunits, CP43 and CP47, with several well-defined ancient and more recent events. These include the duplication of the core catalytic subunits of ATP synthase, a very ancient event generally accepted to have occurred before the last universal common ancestor (LUCA) [21][22][23][24][25]; and the evolution of RNA polymerase catalytic subunit β (RpoB) and ribosomal proteins, which are universally conserved and widely accepted to have originated before the LUCA [26][27][28][29]. We further constrain our analysis using in silico ancestral sequence reconstruction of PSII and through strict structural and functional rationales. We show that the core subunits of PSII show patterns of molecular evolution that are usually associated with some of the oldest transitions in the evolution of life. We also show that all events leading from a primordial homodimeric photosystem to Cyanobacteria's heterodimeric PSII can be logically reconstructed.

Phylogenetic overview
The phylogenies of CP43 and CP47 show that there is a much greater diversity of CP43 and CP43-derived subunits than CP47 (Fig. 1). This difference is the result of a greater number of gene duplication in CP43 than CP47 and mirrors the evolution of D1 compared with D2 [2], in which D1 has undergone more duplication events than D2. CP43 can be divided into two major groups: those that are assembled into PSII and can bind the Mn 4 CaO 5 cluster, and those which have evolved to be used only as light harvesting complexes [30,31], known as chlorophyll binding proteins (CBP). The CBP are characterized by the loss of the extrinsic loop between the 5th and 6th transmembrane helices, where the ligands to the cluster are located (Fig. 1b). This large extrinsic loop is found in both CP43 and CP47 and interacts directly with the electron donor side of PSII, within D1 and D2 respectively. The unrooted tree of CP43 is consistent with CBP having a single origin likely occurring before the MRCA of Cyanobacteria ( Supplementary Fig. S1) but have undergone an extensive duplication-driven diversification process. It also mirrors the evolution of D1 in that duplications appear to have occurred before the MRCA of Cyanobacteria [2]. The earliest of these D1 duplications also led to variants that lack the capacity to bind the Mn 4 CaO 5 cluster [2,32], but are likely used in other supporting functions. Most notably, during chlorophyll f synthesis [33] in the far-red light acclimation response (FaRLiP) [34].
CP43 and CP47 are also distantly related to the antenna domain of cyanobacterial PSI and other type I RCs of phototrophic Chlorobia, Acidobacteriota and Heliobacteria ( Supplementary Fig. S2). We found that the level of sequence identity between any two type I RC proteins is always greater than between a type I RC protein and CP43/CP47 (Supplementary Table S1). Therefore, the distance between CP43/CP47 and other type I antenna domains is the largest distance in the molecular evolution of RC proteins after that between type I core and type II RC core domains. The phylogenetic relationships between type I and type II RCs have been reviewed in detail before [20,35,36].
The phylogenetic distribution of Alpha and Beta subunits of the Ftype ATP synthase showed that all Cyanobacteria have an F-type ATP synthase, and a fewer number of strains have an additional Na + -translocating ATPase (N-ATPase) of the bacterial F-type, as had been reported before [37]. We found that MSV have a standard F-type ATP synthase ( Supplementary Fig. S3), but some N-ATPase Alpha and Beta sequences were also found in Vampirovibrionia and Sericytochromatia datasets, but not in Margulisbacteria. In this study we focused on the standard Ftype ATP synthase of Cyanobacteria for further analysis.
We also constructed a phylogeny of bacterial RNA polymerase subunit β (RpoB) for divergence time estimation. This focused on Cyanobacteria and MSV, as well as phyla with known phototrophic representatives and included Thermotogae and Aquificae as potential outgroups ( Supplementary Fig. S4). The tree was largely consistent with previous observed relationships between the selected groups [38], within Cyanobacteria and MSV, and within other phototrophs and their non-phototrophic relatives [39,40]. The only exception was Aquificae, which branched as a sister clade to Acidobacteriota, a feature that had been reported before for RpoB [28], and likely represents an ancient horizontal gene transfer event.

Distances
To gather temporal information, we compared the phylogenetic distances between CP43 and CP47, Alpha and Beta subunits of the Fig. 1. Maximum Likelihood trees of PSII core antenna subunits and derived light harvesting proteins. a A tree of CP43 and chlorophyll binding proteins (CBP). The unrooted tree splits into CP43 and CBP, with CP43 displaying a phylogeny roughly consistent with the evolution of Cyanobacteria, although several potential duplications of CP43 are noticeable within heterocystous Cyanobacteria and closer relatives. FaRLiP variant denotes the isoform used in the far-red light acclimation response. The CBP subtree shows a complex phylogeny driven by gene duplication events and fast evolution. The tree of CP47 was rooted at the divergence of Gloeobacter spp. Scale bar represents number of substitutions per site. b Structural comparison of CP43, CP47, and IsiA. The latter lacks the characteristic extrinsic loop (orange) of CP43 and CP47 that links the antenna with the electron donor side of PSII. cyanobacterial F-type ATP synthase, and archaeal and bacterial RpoB (visualized in Fig. 2, but see also Supplementary Figs. S1b, S3 and S4b). We found that the distances between Alpha and Beta, and the divergence of archaeal and bacterial RpoB, are very large relative to the distance between the divergence of Vampirovibrionia and Cyanobacteria. In the case of RpoB, the distance between Vampirovibrionia and Cyanobacteria is about a fifth of the distance between Archaea and Bacteria. However, the distance between CP43 and CP47 (and also between D1 and D2 [19]) is of similar magnitude to that between Alpha and Beta, and to that between archaeal and bacterial RpoB, but substantially surpasses the distance between MSV and Cyanobacteria (Fig. 2). These observations suggest that ancestral proteins to CP43/CP47 and D1/D2 existed before the divergences of MSV.
We compared the within-group mean distances for Alpha, Beta, RpoB, and a concatenated dataset of ribosomal proteins compiled in a previous independent study [39] (see Supplementary Table S2). We consistently found that Vampirovibrionia and Margulisbacteria have larger within-group mean distances compared to Cyanobacteria, which suggests greater rates of evolution in the non-photosynthetic clades. These were larger for Margulisbacteria relative to the other two groups. Thus, RpoB in Vampirovibrionia and Margulisbacteria showed 1.6× and 4.0× larger corrected mean distances than Cyanobacteria, respectively (Supplementary Table S2). To put this into scale, RpoB from two strains of Margulisbacteria belonging to the genus Termititenax [8] or between two distant Gastranaerophilales (Vampirovibrionia) [41], show levels of divergence that surpass the largest distance between the two most distant extant Cyanobacteria. For example, the level of sequence identity for the two Termititenax is 78%, between two distant   [2,19]. ChlF marks the atypical D1 form involved in the synthesis of chlorophyll f during FaRLiP [33]. The duplication leading to D1 and D2 is denoted as dD0. e Schematic representation of distance and distribution of these enzymes in Cyanobacteria and relatives in relation to the MRCA of Bacteria, of Archaea, and the LUCA.
Gastranaerophilales is 70%, but for Gloeobacter spp. compared with any other cyanobacterium is about 84%. The latter likely being a much older diversification event. At the level of the concatenated ribosomal proteins dataset, Margulisbacteria showed an almost 2-fold larger within-group mean distance than Cyanobacteria. Faster rates of evolution are consistent with the trophic modes of the few characterized strains of Margulisbacteria and Vampirovibrionia, suggesting that these may be generalizable within their diversity, and resemble similar, but perhaps not as extreme, observations made for the Candidate Phyla Radiation (CPR) and DPANN Archaea [42,43]. In fact, the long branches of CPR and MSV artefactually attract each other in phylogenomic trees [39,44,45], potentially indicating convergent evolutionary processes.

Deep-time duplications
We then compared the rates of evolution of CP43 and CP47 with those of Alpha and Beta cyanobacterial F-type ATP synthase, using a Bayesian relaxed molecular clock approach with identical calibrations, molecular clock parameters, and a simplified, but highly constrained sequence dataset (see Materials and Methods and Supplementary Text S2 for an expanded rationale and technical details). The goal of these experiments is not to use the clock to time the LUCA, the origin of photosynthesis or Cyanobacteria, but to measure the rates of protein evolution that are required to model or simulate any chosen span of time between the ancestral duplications and the MRCA of Cyanobacteria. We used an autocorrelated log normal model of rate variation with a nonparametric CAT+Γ model of amino acid substitutions to extract rates of evolution. We refer to the span of time between the duplication points leading to Alpha and Beta (dAB), or to CP43 and CP47 (dCP), and the MRCA of Cyanobacteria as ΔT (schematized in Fig. 3).
In Fig. 4a-d we examine the changes in the rate of evolution under specific evolutionary scenarios. In the case of ATP synthase, we first assumed that the MRCA of Cyanobacteria occurred after the GOE to simulate scenarios similar to those presented in [11] or [46], at about 1.7 Ga; and that dAB occurred at 3.5 Ga (ΔT = 1.8 Ga). Under these conditions the average rate of evolution of Alpha and Beta is calculated to be 0.28 ± 0.06 substitutions per site per Ga (δ Ga − 1 ). We will refer to the average rate through the Proterozoic as ν min . In this scenario, the rate of evolution at the point of duplication, which we denote ν max , is 7.32 ± 1.00 δ Ga − 1 , making ν max /ν min 26. In other words, when the span of time between the ancestral pre-LUCA duplication and the MRCA of Cyanobacteria is 1.8 Ga, the rate of evolution at the point of duplication is about 26 times greater than any rate observed through the diversification of Cyanobacteria or photosynthetic eukaryotes. To place these rates in the larger context of protein evolution, we encourage the reader to refer to Supplementary Text S1. Now, if we consider a scenario in which dAB is 4.0 Ga and leaving all other constraints unchanged, ν max is 6.02 ± 0.9 δ Ga − 1 resulting in a ν max /ν min of 21. If instead we keep the duplication at 3.5 Ga but assume that the MRCA of Cyanobacteria occurred before the GOE to simulate a more conservative scenario, at 2.6 Ga (ΔT = 1.1 Ga), we obtain that ν min is consequently slower, 0.25 ± 0.06 δ Ga − 1 , when compared to a post-GOE ancestor. This older MRCA (smaller ΔT) thus leads to a rise in ν max , calculated to be 10.22 ± 1.37 δ Ga − 1 and leading to a ν max /ν min of 40. Given that the phylogenetic distance is a constant, the rate of evolution increases with a decrease in ΔT following a power law function. We had observed nearly identical evolutionary patterns for the core RC proteins D1 and D2 of PSII [19]. The change in ν max /ν min as a function of ΔT is shown in Fig. 4d. The core antennae of PSII, CP43 and CP47, showed patterns of divergence similar to those of Alpha and Beta, both in terms of ν max at the point of duplication (dCP) and ν min ( Fig. 4a and b, and Table 1).
However, CP43 and CP47 featured slightly slower rates of evolution than Alpha and Beta, which is consistent with the fact that CP43 and CP47 show less sequence divergence between orthologues at all taxonomic ranks of oxygenic phototrophs (see Supplementary Table S3). We then studied a relatively recent gene duplication event (Fig. 4c), which occurred long after the LUCA, but also after the MRCA of Cyanobacteria: that leading to Cyanobacteria-specific FtsH1 and FtsH2 [47]. This more recent duplication served as a point of comparison and control (see Fig. 3 for a scheme). In marked contrast to dAB, the rate at the point of duplication was 0.66 ± 0.21 δ Ga − 1 . We also found that FtsH1 is evolving nearly 6 times faster than FtsH2 under the assumption that MRCA of Cyanobacteria occurred at 1.7 Ga (Table 1). If the MRCA of Cyanobacteria is assumed to have occurred at 2.6 Ga, all rates slowdown proportionally. This late duplication is consistent with classical neofunctionalization, in which the gene copy that gains new function experiences an acceleration of the rate of evolution [48,49]. Like PSII and ATP synthase, the calculated rates of evolution match observed distances as estimated by the change in the level of sequence identity as a function of time, in which the fastest evolving FtsH1 accumulated Cyanobacteria inherited a standard F-type ATP synthase, with a heterohexameric catalytic head (F 1 ) made of alternating subunits Alpha and Beta; and a PSII with a heterodimeric core and antenna. ΔT denotes the span of time between the ancestral duplication events and the MRCA of Cyanobacteria. In the case of ATP synthase, this duplication is suspected to antedate the divergence of Bacteria and Archaea and their further diversification. FtsH contains an N-terminal membrane-spanning domain attached to a soluble domain consisting of an AAA (ATPase associated with diverse activities) module attached to a C-terminal protease domain. FtsH is universally conserved in Bacteria, has a hexameric structure like that of ATP synthase's catalytic head, and can be found usually as homohexamers, but also as heterohexamers. The MRCA of Cyanobacteria likely inherited three variant FtsH subunit forms, one of which appears to have duplicated after the divergence of the genus Gloeobacter and possibly other early-branching Cyanobacteria [47]. This late duplication led to FtsH1 and FtsH2, which form heterohexamers with FtsH3, following the nomenclature of Shao et al. [47]. FtsH1/FtsH3 is found in the cytoplasmic membrane of Cyanobacteria, while FtsH2/FtsH3 is involved in the degradation of PSII and other thylakoid membrane proteins.
greater sequence change than FtsH2 in the same period (Supplementary Table S3).
Given that the complex evolution of CP43 and CBP involved several major duplication events and potentially large variations in the rate of evolution ( Fig. 1 and Supplementary Fig. S1), we carried out a molecular clock analysis of a large dataset of 392 CP43 and 465 CBP proteins using cross-calibrations across paralogues as described in Supplementary Text S2. Clocks were executed with no constraint on the MRCA of Cyanobacteria. The estimated mean divergence time for the oldest node, the duplication at the origin of CBP, is 2.23 Ga (95% confidence interval, CI: 1.90-2.69 Ga) using an autocorrelated log normal model (see Fig. 4e, Supplementary Fig. S5 for a chronogram, and Supplementary Table S4 for a comparison of estimated ages under different models). The mean divergence time for the node representing the CP43 inherited by the MRCA of Cyanobacteria was 2.22 Ga (95% CI: 1.88-2.68 Ga). Thus, a span of time of only 15 Ma is measured between these two mean ages. The average rate of evolution of CP43, not including CBP sequences, was found to be 0.14 ± 0.05 δ Ga − 1 , which is in the same range as determined in the simplified, but highly constrained experiment above. We noted a 6-fold increase in the rate of evolution associated with the duplication leading to the FaRLiP-CP43 variant (Fig. 4e). This duplication led to an acceleration of the rate similar in magnitude to that of FtsH1/FtsH2 and is consistent with neofunctionalization as the photosystems evolved to use far-red light and bind chlorophyll f.

Fig. 4.
Comparison of the changes in the rates of evolution as a function of time. a Rates of CP43 and CP47 are simulated using two specific evolutionary scenarios (orange and grey traces). The orange trace was calculated under the assumption that the MRCA of Cyanobacteria occurred after the GOE, at ~1.7 Ga, while the duplication of CP43 and CP47 occurred at ~3.5 Ga. The fastest rate seen at the point of duplication is denoted as ν max , and stabilizes during the Proterozoic, ν min . The grey trace was calculated under the assumption that the MRCA of Cyanobacteria occurred before the GOE, at ~2.6 Ga. b Same calculations as a but performed on Alpha and Beta sequences of cyanobacterial and plastid ATP synthase. c Rates of evolution of cyanobacterial and plastid FtsH subunits assuming a MRCA of Cyanobacteria at ~1.7 and ~2.6 Ga. d Changes in the rates of evolution with varying ΔT for ATP synthase (orange) and PSII subunits (grey). CBP sequences, on average, display rates of evolution about three times faster than CP43 (Fig. 4e). However, the serial duplications that led to the evolution of CP43-derived light harvesting complexes resulted in accelerations in the rate of evolution of a similar magnitude as observed for dAB and dCP. The largest of these is associated with the origin of PcbC [31], a variant commonly found in heterocystous Cyanobacteria and Cyanobacteria that use alternative pigments, such as chlorophyll b, d and f, but that remains largely uncharacterized. The ancestral node of PcbC was timed at 2.07 Ga (95% CI: 1.76-2.50 Ga) with a rate of 11.7 ± 2.42 δ Ga − 1 , decelerating quickly, but stabilizing at about four times faster rates than the average rate of CP43. We find it noteworthy that the fast rates of evolution at the origin of CBP are not associated with very large spans of time between these and CP43, nor did it result in very old root node ages despite the use of very broad constraints.

Species divergence
To understand the evolution of MSV relative to Cyanobacteria we wished to apply a molecular clock to a system where the calculated rates could be compared to observed rates, as determined by distances between species of known divergence times or at similar taxonomic ranks. We found RpoB to be suitable for this because it has been inherited vertically with few instances of horizontal gene transfer and had enough signal to resolve known phylogenetic relationships between and within clades. We implemented a set of 12 calibrations across bacteria, including two calibrations on Margulisbacteria and two in Vampirovibrionia with the aim of covering both slower and faster evolving lineages. The following results are based on an autocorrelated log normal molecular clock using CAT+Γ, a root constrained with a broad interval ranging from between 4.52 and 3.41 Ga, and as described in Materials and Methods and Supplementary Text S3 (Fig. 5). We found this to perform well and provided results comparable to other independent studies that did not combine a full set of MSV sequences and other clades with phototrophs in a single tree (Table 2). Nonetheless, a pipeline of sensitivity experiments tested the dependency of these results on models and prior assumptions: these are shown and described in Supplementary Figs. S6 and S7.
The root of the tree (divergence of Thermotogae) was timed at 3.64 Ga (95% CI: 3.42-4.11 Ga) and the divergence of Cyanobacteria at 2.74 Ga (95% CI: 2.46-3.12 Ga). Thus, the span of time between the mean age of the root of the bacterial tree and the mean age of the MRCA of Cyanobacteria was calculated to be 0.89 Ga and that between Vampirovibrionia and the latter was found to be 0.44 Ga (Figs. 4f and 5). This result is consistent with previous studies using different rationales, datasets and calibrations [11,12].
We also noted an exponential decrease in the rates of evolution of RpoB through the Archean, which stabilized at current levels in the Proterozoic (Fig. 4f). The rate at the root node was calculated to be 2.37 ± 0.45 δ Ga − 1 and the average rate of evolution of RpoB during the Proterozoic was found to be 0.19 ± 0.06 δ Ga − 1 . The average rate of cyanobacterial RpoB was 0.14 ± 0.04 δ Ga − 1 ; for Margulisbacteria was 0.44 ± 0.17 δ Ga − 1 , and for Vampirovibrionia 0.19 ± 0.05 δ Ga − 1 : about 3.1× and 1.3× the mean cyanobacterial rate, respectively. These rates agree reasonably well with the observed distances (Supplementary   Table S2), further indicating that the calibrations used in these clades performed well. Nevertheless, we suspect that the values for MSV represent underestimations of the true rates of evolution (slower than they should be), as some of the clades that include symbionts still appear much older than anticipated from their hosts. This is probably due to the relaxed nature of molecular clocks, which tends to smooth rates out [50]. For example, Gastranaerophilales were timed to have originated long before the emergence of animals (95% CI: 1.91-2.74 Ga). This dichotomy in the rates was also detected for Rickettsidae (0.40 ± 0.11 δ Ga − 1 ) made up of symbionts and parasites, when compared to Caulobacteridae (0.17 ± 0.05 δ Ga − 1 ) of the Alphaproteobacteria, which included free-living and phototrophic strains. We also noted the same potential underestimation for many strains of Rickettsidae, which appeared to be much older than their symbiotic associations would suggest, despite the provided constraints.
A more complex, but commonly used model like CAT+GTR+Γ implementing a birth-death prior with 'soft bounds' on the calibrations, resulted in rates that were smoothed out, and which translated into spread-out divergence times with Margulisbacteria and Vampirovibrionia evolving at 1.9× and 0.7× times the cyanobacterial rates, respectively ( Supplementary Fig. S7, model n to p). These rate effects are thus translated into a very late Mesoproterozoic age for Cyanobacteria, and a relatively older divergence time for Vampirovibrionia: results that replicate those presented in ref. [46].
To investigate the effect of the age of the root on the divergence time of MSV and Cyanobacteria we also varied the root prior from 3.2 to 4.4 Ga ( Supplementary Fig. S6). We noted that regardless of the time of origin of Bacteria (approximated by the divergence of Thermotogae in our analysis), a substantially faster rate is required during the earliest diversification events, decreasing through the Archean and stabilizing in the Proterozoic. This matches well the patterns of evolution of ATP synthase and PSII core subunits as shown in the previous section.
We then compared the above RpoB molecular clock result with a different clock that included a set of 112 diverse sequences from Archaea, in addition to the sequences from Thermotogae, MSV, and Cyanobacteria, but removing all other bacterial phyla (Fig. 6a). We found that the calculated average rate of evolution of bacterial RpoB during the Proterozoic was slower (0.09 ± 0.03 δ Ga − 1 ) than in the absence of archaeal sequences (0.19 ± 0.06 δ Ga − 1 ), resulting in overall older mean ages (see Fig. 4g). However, the rate at the Bacteria/Archaea divergence point was 6.87 ± 1.17 δ Ga − 1 , similar to the rate for dAB and dCP, requiring therefore an exponential decrease in the rates similar to that observed for ATP synthase and the PSII core subunits. Notably, this rate and exponential decrease is associated with a large span of time between the mean estimated ages of the LUCA and the MRCA of Cyanobacteria, calculated in this tree to be 1.21 Ga. The span between Vampirovibrionia and Cyanobacteria was found to be 0.46 Ga (Fig. 6a). Fig. 5 also highlights that the none of the most recent common ancestors of the groups containing anoxygenic phototrophs nor their divergence from their closest non-phototrophic relatives appears to be older than one of the oldest and best accepted geochemical pieces of evidence for photosynthesis at 3.41 Ga [51] (Table 2). For example, the MRCA of Heliobacteria or of phototrophic Chlorobia, containing homodimeric type I RCs, are likely to have existed only after the GOE, even allowing for large uncertainties in the calculations. These clades Table 1 Rates of evolution.  Table 2 and Fig. 6), a similar exponential decrease in the rates was observed with a rate at the Bacteria/Archaea divergence point measured at 4.62 ± 0.71 δ Ga − 1 and an average rate of bacterial ribosomal protein evolution of 0.30 ± 0.07 δ Ga − 1 (Fig. 4h). These rates were associated with a span of time between the mean estimated ages of the LUCA and the MRCA of Cyanobacteria of 1.45 Ga; and between Vampirovibrionia and Cyanobacteria 0.54 Ga (Fig. 6b).

Structural analysis
A fundamental premise of our investigation is that water oxidation started before the duplication of D1 and D2, and CP43 and CP47. The rationale behind this premise has been laid out before [18,52], and more extensively recently [19]. If this rationale is correct, it raises the question of how the CP47/D2 side of the RC lost its capacity to carry out water-splitting catalysis.  Table 2.

Ancestral sequence reconstruction
To gain further insight on the nature of the structural site around the water-oxidizing complex in the ancestral photosystem, we used ancestral sequence reconstruction to predict the most probable ancestral states. We will refer to the ancestral protein to D1 and D2 as D0 (Supplementary Fig. S8). We generated 14 predicted D0 sequences using a combination of three ASR methods and amino acid substitution models. On average the 14 D0 sequences had 87.12 ± 0.55% sequence identity indicating that the different algorithms provided largely consistent results. While the regions that include all transmembrane helices are aligned unambiguously, the N-terminal and C-terminal ends were aligned less confidently due to greater sequence variability at both ends. Nonetheless, we found that the predicted D0 sequences retain more identity with D1 than D2 along the entire sequence. The average level of sequence identity of D0 sequences compared with the standard D1 (PsbA1) of Thermosynechococcus vulcanus was found to be 69.58 ± 0.55%, and for D2 it was 36.32 ± 0.15% from the same organism.
The ligands to the Mn 4 CaO 5 in PSII are provided from three different structural domains (Supplementary Fig. S9): 1) the D1 ligands D170 and E189, located near the redox Y Z . 2) The D1 ligands H332, E333, D342, and A344 located at the C-terminus. 3) The CP43 ligands, E354 and R357, located in the extrinsic loop between the 5th and 6th helices, with the latter residue less than 4 Å from the Ca atom. Remarkably, there is structural and sequence evidence supporting the loss of ligands in these three different regions of CP47/D2.
In all the D0 sequences, at position D1-170 and 189, located in the unambiguously aligned region, the calculated most likely ancestral states were E170 and E189, respectively. The mutation D170E results in a PSII phenotype with activity similar to that of the wild-type [53].  Table S5. In contrast, D2 has strictly conserved phenylalanine residues at these positions, but the PP of phenylalanine being found at either of these positions was less than about 5% for all predicted D0 sequences. As a comparison, the redox active tyrosine residues Y Z (D1-Y161) and Y D (D2-Y160), which are strictly conserved between D1 and D2, have a predicted average PPs of 68.8% (FastML), 98.8% (MEGA) and 98.6% (PAML). Therefore, the ligands to the catalytic site in the ancestral protein leading to D2 were likely lost by direct substitutions to phenylalanine residues, while retaining the redox active D2-Y160 (Y D ) and H189 pair ( Supplementary Fig. S9).

A structural rearrangement
Prompted by the finding of a Ca-binding site at the electron donor site of the homodimeric type I RC of Heliobacteria (Firmicutes) with several similarities to the modern Mn 4 CaO 5 cluster of PSII, including a link to the antenna domain and the C-terminus [20], we revisited the sequences and structural overlaps of CP43 and CP47. We found that a previously unnoticed structural rearrangement within the extrinsic loop occurred in one subunit relative to the other (marked EF 3 and EF 4 in Fig. 7, Supplementary Figs. S11 and S13). CP43 retains the simplest domain, being about 60 residues shorter than CP47. If CP43 retains the ancestral fold, the additional sequence in CP47's swapped domain (EF 4 in Fig. 7d) would have contributed to the loss a catalytic cluster as it inserted one phenylalanine residue (CP47-F360) into the electron donor site, less than 4 Å from Y D . An equivalent residue does not exist in CP43.
We then noted that in the swapped region (EF 3 in Fig. 7d), sequence identity is retained between CP43 and CP47 ( Supplementary Fig. S11). We found that CP43-E354 and R357 are equivalent to CP47-E435 and N438 in the T. vulcanus structure. An inspection of this structure showed that these two residues specifically bind a Ca atom of unknown function a Data from TimeTree compiling estimated divergence times from independent studies as described in ref. [140]. For the divergence of Thermotogae the estimated age was taken from eight different studies (n = 8). For the divergence of Melainabacteria (or divergence from Cyanobacteria's closest relatives), n = 10. For stem Heliobacteria, n = 2. For stem phototrophic Chloroflexota and Chlorobia, the values were reported from a single study, Marin et al. [65]. For earliest potential phototrophic Proteobacteria, n = 3. This latter was taken as the node made by the clades including Niveispirillum lacus and Rhodobacter sphaeroides to match our RpoB tree. The independent studies used by TimeTree to generate the estimated ages are listed in Supplementary Table S8. b Data taken from ref. [11] Model T68 for the cyanobacterial dates. That is, no GOE calibration with a calibration on Bangiomorpha. The dates on Chloroflexota were taken from ref. [141]. c Data taken from ref. [12] Model A. This used a 1.2 Ga calibration on heterocystous Cyanobacteria.
( Fig. 7c and d). The presence of an equivalent glutamate to CP43-E354 in CP47 is consistent with this being already present before duplication and therefore, at an early stage of photosystem evolution.

Gene overlap
Finally, a peculiar but well-known trait conserved across Cyanobacteria and photosynthetic eukaryotes is that the 5 ′ end of the psbC gene (CP43) overlaps with the 3 ′ end of the psbD gene (D2) usually over 16 bp (Supplementary Table S6). The psbD gene contains a well-defined Shine-Dalgarno ribosomal binding site downstream of the psbC gene and over the coded D2 C-terminal sequence [54][55][56]. The evolution of this unique gene overlap has no current explanation in the literature, but its origin could have disrupted the C-terminal ligands of the ancestral protein leading to the modern D2, an event that would have contributed to heterodimerization.

Origin of oxygenic photosynthesis and rates of evolution
This work was not explicitly aimed at providing an exact time of origin for the LUCA, the MRCA of Cyanobacteria, or the origin of photosynthesis. The use of different datasets, calibration strategies, chosen clock parameters, and researchers' preferences can result in widely varying calculated rates, which can translate into disparate estimated ages, see for example [46,57]. Instead, rather than arguing in favour or against a particular age or methodology, our objective was to contrast the evolution of oxygenic photosynthesis with the evolution of enzymes whose origin is less controversial. The duplication of ATP synthase's ancestral catalytic subunit, and the archaeal/bacterial divergence of RNA polymerase and the ribosome, are some of the oldest evolutionary events known in biology [21][22][23][24][25][26][27][28][29]. When taken on their own, their phylogenies are often interpreted as evidence of the earliest origin. We have shown here and in our previous work [19] that PSII has patterns of molecular evolution that closely parallel those of these most ancient systems. These patterns do not emerge from the application of any particular molecular clock approach or model of rate change, but from the inherently long phylogenetic distance that separates Alpha and Beta, archaeal and bacterial RNA polymerase and ribosomes, or the core subunits of PSII, in combination with the slow rates of evolution that these enzymes have featured through their multibillion-year diversification process. These results have profound implications because it cannot now be taken for granted that there was ever a long period of time between the origin of life and the origin of anoxygenic photosynthesis, followed by another long period of time between the origin of anoxygenic photosynthesis and the origin of oxygenic photosynthesis.
The exponential decrease in the rates of evolution observed in the studied systems, even when the span of time between the ancestral duplications (or the LUCA) and the MRCA of Cyanobacteria was considered to be well over a billion years, is consistent with our assessment that a large ΔT for the evolution of the core subunits of ATP synthase and PSII cannot be explained entirely by duplication-driven effects. Instead, we speculate that these effects are the result of the faster rates of evolution that are expected to have occurred during the earliest history of life [58][59][60][61][62][63]. It is worth highlighting that ATP synthase, RNA polymerase, the ribosome, and the photosystems, are all complex molecular machines, with crucial functions and under strict regulation. These features largely explain their slow rates of evolution and the high degree of sequence (structural and functional) conservation through geological time. A similar level of complexity, however, can be traced back to the last common ancestor of each one of these molecular systems regardless of how ancient they truly are. Now, given that the Only the first six transmembrane helices of the antenna are shown for clarity. A Ca at the electron donor side is bound from an extrinsic loop between the 5th (E) and 6th (F) helices. This extrinsic loop, EF 1 (blue), is made of two small alpha helices. The fourth molecular view furthest to the right shows the link between the electron donor site and EF 1 in closer detail. b The CP43 subunit of PSII with the extrinsic loop shown in colours. c The CP47 subunit of PSII. Immediately after the 5th helix (E), a long alpha helix protrudes outside the membrane in both CP43 and CP47 and showing structural and sequence identity (orange). We denote this helix EF 2 . After EF 2 structural differences are noticed between CP43 and CP47 as schematized in panel d. In CP43, after helix EF 2 a loop is found (shown in red ribbons), which we denote EF 3 . This contains the residues that bind the Mn 4 CaO 5 cluster and it is followed by a domain that resembles EF 1 in the HbRC at a structural level. In CP47, EF 3 and EF 1 retain sequence identity with the respective regions in CP43. CP47 has additional sequence that is not found in CP43 (EF 4 , purple). The green arrows mark the position at which the domain swap occurred in CP43 relative to CP47. We found that the CP43-E354 and R357 are found in the equivalent domain in CP47 as E436 and N438 coordinating a Ca atom. N438 (EF 3 ) links to EF 1 via K332. It is unclear if the EF 1 region in the HbRC is strictly homologous to that in CP43 and CP47 as very little sequence identity is found between the two: however, a couple of conserved residues between all EF 1 may suggest it emerged from structural domains present in the ancestral RC protein (see Supplementary Fig. S12).
rates of evolution of the core of PSII have remained slower than those of ATP synthase for billions of years, it could imply that the duplications of the core of PSII occurred before the duplication leading to Alpha and Beta. In consequence, to argue that oxygenic photosynthesis is a late evolutionary innovation relative to the origin of life, one must first demonstrate that the rates of protein evolution of PSII and ATP synthase catalytic core subunits, RNA polymerase, and the ribosome, have not remained proportional to each other during the Archean. That is to suggest that PSII-exclusively-experienced unprecedentedly faster rates of evolution during the heterodimerization process, and then, the rates stopped abruptly with the MRCA of Cyanobacteria. One could potentially argue that the origin of water oxidation itself could account for unprecedentedly faster rates. However, a decrease in ΔT of about 60% (from 1.1 to 0.46 Ga, for example) would result in a 30-fold additional increase in the rates at the earliest stages of diversification, resulting in photosystems that would be evolving at rates that surpass those of the fastest evolving peptide toxins (Fig. 4d and Supplementary Text S1). Furthermore, it should be noted that a ΔT of over a billion years already includes a period of extremely rapid evolution at the origin of all these ancient systems.  Firmicutes). The RC protein is made up of 11 transmembrane helices. The first six N-terminal helices are traditionally considered as the antenna domain (orange ribbons), while the last five helices are the RC core domain (grey and light-blue ribbons). Below the structure, the organization of the 11 helices is laid down linearly for guidance: 1 N denotes the first N-terminal helix and 11 C the last C-terminal helix. P denotes the "special pair" pigment; M, the "monomeric" bacteriochlorophyll electron donor; and A, the primary electron acceptor. F X is the Fe 4 S 4 cluster characteristic of type I RCs. Antenna pigments bound by the antenna domain are shown in green lines. Antenna pigments bound by the 7th and 8th helices are shown as purple sticks, except for the bacteriochlorophyll molecule denoted as Z, in olive green and bound by the 8th helix. The antenna domain connects the electron donor side of the RC via N263 and a Ca-binding site. b A monomeric unit of PSI (PsaB, Cyanobacteria) following a similar structural organization and nomenclature as in panel a. Unlike the HbRC, PSI binds quinones (Q) as intermediary electron acceptors between A and F X . c A monomeric unit of PSII (CP43/D1 and CP47/D2 in the inset). It has a structural organization similar to that of type I RCs. However, the monomeric unit is split into two proteins after the 6th transmembrane helix. The Mn 4 CaO 5 cluster is coordinated by D1 but is directly connected to the antenna domain via E354 and R357 in manner that resembles the Ca-binding site found at the electron donor site of the HbRC. d A monomeric unit of an anoxygenic type II reaction (PbRC, Proteobacteria). Unlike type I RCs, type II RCs lack F X . Instead, a non-heme Fe 2+ is found linking the RC proteins. The PbRC lacks antenna domain and the antenna pigment Z bound at the equivalent 8th helix.

Diversification of bacteria
It has been postulated before that the diversification of the major phyla of Bacteria occurred very rapidly, starting at about 3.4 Ga and peaking at about 3.2 Ga, in what has been dubbed the Archean Genetic Expansion [64], but see also [46,65]. A recent study suggested that the major groups of Bacteria and Archaea diversified rapidly after the LUCA and hypothesized that the long distance between archaeal and bacterial ribosomal proteins could be attributed to fast rates of ribosome evolution during their divergence [66], but see also [67,68]. Our analyses suggest that the more explosive the diversification of prokaryotes, the greater the chance that photosynthetic water-splitting is an ancestral trait of life. Yet, if phototrophic communities-whether oxygenic or not-already existed 3.2 to 3.4 Ga ago, as it is supported by the geochemical record [51,69], then the earliest bacteria were likely photosynthetic. This is consistent with the evolution of RC proteins, which suggests that the structural and functional specialization that led to the two photosystem types, antedated the diversification processes leading to the known phyla containing photosynthetic bacteria [35]. It also explains why none of the groups of extant photosynthetic bacteria appear to be older than the earliest geochemical evidence for photosynthesis or hardly older than the GOE. The presented data is also consistent with recent studies of the oxygenation of the planet, which suggests that even if photosynthetic O 2 evolution started as early as the oldest rocks, the properties of early Earth biogeochemistry would have maintained very low concentrations of O 2 over the Archean [70][71][72].

Photosystem structural constraints
We have shown that the phylogenetic distance between CP43/CP47 and other type I RC proteins is the largest distance after that between type I and type II RC proteins (Supplementary Fig. S2). The antenna domain extends to the 8th helix, both in type I RCs and in PSII, with the latter retaining one antenna chlorophyll in the equivalent 8th helix (marked Z in Fig. 8) and substantial sequence identity around this chlorophyll's binding site, as shown before [35,73]. These conserved features indicate that at no time during the evolution of PSII was it devoid of core antenna proteins. In other words, CP43/CP47 are a descendant of the ancestral core antenna of type II photosystems. Both structural and phylogenetic evidence is in agreement and indicates that the ancestral type II RC, at the dawn of photosynthesis, was architecturally like water-splitting PSII.
Sequence reconstruction of the ancestral subunit to D1 and D2 is consistent with the existence of a highly oxidizing homodimeric photosystem, though transient and short-lived [19], that could split water in either side of the RC [52]. The structural comparisons between CP43/D1 and CP47/D2 suggest a mechanism for heterodimerization and loss of catalysis on one side that accounts for all ligands. These include direct amino acid substitutions, the loop swap within the extrinsic region of the ancestral protein to CP47, and the gene overlap between psbC (CP43) and psbD (D2). It would be difficult to reconcile these unique structural and genetic features with a scenario in which PSII evolved water oxidation at a late stage, or starting as a purple anoxygenic type II RC, once the heterodimerization process was well underway or completed, or in the absence of core antenna domains, given that these interact directly with the electron donor side of PSII, in a manner strikingly similar to that of the Ca-site in the homodimeric type I RC of the Firmicutes [20]. A conserved Ca-binding site has also been recently confirmed for the RC of the Chlorobia [74], as predicted by Cardona and Rutherford [20]. What is more, these structural and functional features, together with the ASR analysis ( Supplementary  Fig. S8 and S9) suggest that the atypical forms of D1, lacking ligands to the Mn 4 CaO 5 cluster, such as the chlorophyll f synthase [33,75], or the "rogue" or "sentinel" D1 [32,76], diverged from forms of D1 that were able to support water oxidation, but that could antedate the MRCA of Cyanobacteria [2].
We have calculated that (oxygenic) PSII has experienced the slowest rates of evolution of all photosystems, with the core of the anoxygenic type II RC evolving on average about five times faster than the core of PSII [19]. That faster rate has led to conspicuous structural changes of the anoxygenic RC relative to PSII and type I photosystems. The outcome of these differences in the rates are not only visually apparent (Fig. 8), but also the anoxygenic type II RC among all photosystems has experienced the greatest erosion of sequence and structural symmetry and even complete loss of structural domains [20,77]. These differences in the rates and associated structural differences appear inconsistent with a scenario in which PSII experienced unprecedentedly fast rates of evolution. Counterintuitively, because PSII is the slowest evolving of all photosystems and among the slowest evolving enzymes known; and because that has been the case for a couple of billions of years, if not much longer, it follows then that PSII is the most likely RC to have changed the least since the origin of photosynthesis.

Origin of homodimeric water-splitting photosystems
Compelling scenarios that discuss the origin of water oxidation propose Mn-oxidizing photosynthesis as a key transitional stage [18,[78][79][80][81][82]. Few of these explicitly state whether this transition should have occurred before or after the duplications that led to CP43/D1 and CP47/D2. Cardona et al. [2] correlated the phylogenetic branching pattern of the D1 tree to transitional stages from a Mn II -oxidizing photosystem towards the emergence of the full Mn 4 CaO 5 cluster. This correlation was later revisited based on the extended analysis presented in ref. [19], in line with previous suggestions [52], and it is indeed not now supported by ASR analysis as further indicated in this work.
In another scenario, Fischer et al. [78] hypothesized that a Mn IIoxidizing photosystem occurred at the homodimeric stage and considered that the D1 and D2 duplication enabled water oxidation. Fischer et al. [78] also speculated using geochemical evidence from Mn deposits in the ~2.41 Ga Koegas Subgroup of the Kaapvaal Craton of South Africa [83], that the duplication occurred just before the GOE [78]. The data we present here can be consistent with a Mn-oxidizing transitional stage, but does not support an early Paleoproterozoic ancestral photosystem core duplication event. An alternative speculative scenario that could reconcile the Koegas Subgroup deposits with phylogenetics and other geochemical data [84], is that Mn oxidation in this locality was catalysed by a PSII using an atypical D1 form similar or ancestral to the rogue D1 or the ChlF proteins as recently suggested by Chernev et al. [82], but derived from water-splitting variants. It is also plausible that this occurred as a transitional stage towards the emergence of heterotrophy in cyanobacterial ancestors-even perhaps in any of the direct ancestors of MSV-under the selection pressure of other competing lineages of photosynthetic Cyanobacteria, linked to ecological changes occurring around the GOE [85,86].
The data presented here and in [19] suggest that water oxidation antedated the core duplications. We have taken this evidence before at face value to envision a water-splitting homodimeric photosystem that could perform catalysis on both branches of the RC. Such a system, however short-lived or inefficient, raises questions about the actual mechanism of catalysis [52]. For example, Fischer et al. considered the idea of a homodimeric water-splitting photosystem "highly unlikely" [87], because it would "require the coordination of eight electron transfers". This thinking implicitly assumes that water oxidation can only occur through a mechanism that is identical to that of PSII, extrapolated to both sides of the RC, at the same time. This is but one of several scenarios that are consistent with water oxidation prior to core duplication. For example, partial oxidation of water to hydrogen peroxide is a plausible evolutionary transition before the full catalytic cycle of PSII was completed. Hydrogen peroxide release from partially oxidized water can occur in the lower oxidized states of the catalytic cycle in centres that have been perturbed by depletion of Ca 2+ or Cl − [88,89], removal of the extrinsic polypeptides [90], or by heating a sample [91]. Alternatively, a homodimeric system does not necessarily imply entirely homodimeric function. Like PSII, oxidation of Mn on one side of the RC could create asymmetric electrostatic effects on the photochemical pigments, P [92], and the quinone acceptor [93] (Fig. 8).
In addition, the rate of Y Z oxidation by P + accelerates by two orders of magnitude in the presence of an assembled cluster [94]. This acquired asymmetry in a homodimeric photosystem upon Mn oxidation could shift redox equilibria increasing the probability of further oxidation events on one side relative to the other. It is conceivable that a catalytic site could turn over before the other is fully assembled, with the unassembled site assuming a supporting role similar to Y D on D2 [95,96].

Duplications
Before natural selection or drift can drive the heterodimerization of the photosystem, a gene duplication must occur first. The attention can thus be shifted away from asking if the duplication occurred before or after the origin of water oxidation, to asking why the RC gene duplicated to begin with. This is crucial because within the described diversity of photosynthetic organisms, all known RC gene duplications have occurred exclusively in the context of oxygenic photosynthesis. In particular, but not exclusively, duplications of the subunits that allow catalysis, D1 and CP43. In other words, and to the best of our knowledge, there are no known gene duplication events of PufL, PufM, PshA or any version of PscA described for any known anoxygenic phototroph (Fig. 9). Therefore, based on available observations, the only driving force known to lead to RC gene duplications is water oxidation to oxygen, and the subsequent need for sustained repair and protection against ROS-mediated damage. Duplicated subunits can then drift away from their main functions and become the canvas of novel adaptations in a process that has never stopped [2]. This is spectacularly exemplified in the evolution of FaRLiP [97], as chlorophyll f synthesis was likely enabled by a drifting D1 variant accumulating just two amino acid substitutions at the interface with CP43 [98].
Another conspicuous duplication is that leading to the heterodimeric core of PSI, which has been discussed to have occurred as an adaptation to oxygenic photosynthesis [99][100][101][102]. In fact, Orf et al. [103] recently suggested that PSI started to accumulate adaptations to avoid singlet oxygen formation even before the duplication of the core. Furthermore, we have recently shown that the duplication leading to L and M, likely post-dated the evolution of oxygenic photosynthesis [19]. We have also discussed how the heterodimerization of the anoxygenic type II RC core could have occurred under pressure to minimize ROS-mediated damage [19].
The only other discussed, yet more ambiguous duplication is that which gave rise to type I and type II RC proteins, unanimously and implicitly assumed to have occurred before the origin of water oxidation [18,[104][105][106][107]. While we find compelling some of the rationales supporting a duplication at the origin of type I and II RCs, Cardona [77] raised issues with the implicit assumption because, like Mn-oxidizing photosynthesis without water oxidation, it invokes a transitional form of photosynthesis for which no evidence has been found within the known diversity of anoxygenic phototrophs. In this case, the emergence of anoxygenic photosynthesis using both RC types in series. Both phylogenetic and structural evidence converge towards a scenario in which photosynthetic water-splitting started at an early stage during the evolution of life and long before the rise of what we understand today as Cyanobacteria. We have shown evidence suggesting that the ancestral type II RC was similar to PSII, which is consistent with the hypothesis that the type I and type II split, likely enabled by a duplication event, indeed occurred in the context of the establishment of oxygenic photosynthesis [73,77], as illustrated in Fig. 9. We argue that this perspective offers more explanatory power than conventional scenarios. Not only it explains the structural and functional characteristics of PSII, but it also explains the unique coordination sphere of the Mn 4 CaO 5 cluster in a manner consistent with phylogenetic relationships. Distinctly, this perspective does not require the involvement of speculative forms of photosynthesis. More importantly yet, it also provides a mechanism for the generation of genetic diversity via serial gene duplications from where photosystem innovation can occur, and, for which ample evidence exists within the genomes of most Cyanobacteria today.
We think it is plausible that there was never a discrete origin of photosynthesis, but that the process may trace back to abiogenic photochemical reactions, some of which may have resulted in the oxidation of water, and at the interface of nascent membranes, membrane proteins, photoactive tetrapyrroles and other inorganic cofactors: much in the same way that ribosomes may have originated at the interface of nascent genetics and protein synthesis [108]. A photosynthetic origin of life is not a new idea [109][110][111] and abiotic photosynthesis-like chemistry has been recently proposed to have occurred at Gale Crater on Mars [112], and to occur even on Earth [113].

Sequence alignments and phylogenetic analysis
The first dataset was retrieved on the 31st of October 2017 to initiate this project. It included a total of 1389 type I RC, CP43 and CP47 protein sequences from the NCBI refseq database using BLAST. This dataset did not contain CBP proteins. A second dataset was retrieved on the 10th of October 2018 that included 392 CP43, 465 CBP proteins, and 375 CP47 subunits. 40 CP43 and 40 CP47 sequences from a diverse set of photosynthetic eukaryotes were selected manually and included. In addition, a selection of cyanobacterial and plastid CP43, CP47, AtpA (Alpha), AtpB (Beta), and FtsH were manually selected for an analysis of the rates of evolution as described below and in Supplementary Text S2.
A dataset of Alpha and Beta subunits belonging to cyanobacterial Ftype ATP synthase were retrieved from the NCBI refseq on the 31st of August 2019. 507 and 529 cyanobacterial Alpha and Beta sequences were obtained respectively. We retrieved Alpha and Beta homologous for Margulisbacteria (19 and 18 sequences respectively), Sericytochromatia (4 and 2), and Vampirovibrionia (66 and 49) using AnnoTree [114] and searching with the KEGG codes K02111 (F-type Alpha) and K02112 (F-type Beta). A total of 111 AtpA (subunit A) and 176 AtpB (subunit B) from archaeal V-type ATP synthase were retrieved using the sequences from Methanocaldococcus jannaschii as BLAST queries.
A second dataset of RpoB subunits containing only sequences from Cyanobacteria and MSV, in addition to sequences from Archaea was also used. To retrieve the archaeal sequences the subunit B ′ (RpoB1) of Methanocaldococcus jannaschii was used as a BLAST query. A total 213 sequences across the entire diversity of Archaea were obtained. Two version of these were observed, those with split B (RpoB1 and RpoB2) and those with full-length B subunits: from the latter, 112 full-length B subunits were selected for molecular clock analysis. Finally, we also Fig. 9. Schematic representation of the evolution of reaction centre proteins as a function of time. The section of the tree that focuses on PSII and PSI shows the RC gene content of the cyanobacterium Chroococcidiopsis thermalis PCC 7203. These are the product of several ancient and more recent gene duplications starting before the MRCA of Cyanobacteria (blue dotes at node positions). It has been suggested that the type I and II divergence was driven by a gene duplication event. The tree highlights how it is possible that oxygenic photosynthesis started deeper within the evolution of prokaryotes than currently understood (ΔT). We argue that the origin of water oxidation started prior to the CP43/D1 and CP47/D2 duplication and that the earliest type II photosystems were structurally, and potentially also functionally, like water-splitting PSII. Divergence times are meant to point out the relative occurrence of events, rather than exact ages, as described in the text, in ref. [19] for type II RC proteins and in [102] for type I RC proteins. The vertical transparent bars across nodes of interests highlight the uncertainty on the estimation of divergence times. prepared a dataset of concatenated ribosomal proteins containing 2596 aligned sites. This dataset was obtained from an independent study [39] and we selected a total of 157 sequences: 56 cyanobacterial, 14 sequences from Vampirovibrionia, 9 from Margulisbacteria, 9 from Thermotogae and 78 from Archaea.
Sequences were aligned with Clustal Omega using 5 combined guided trees and Hidden Markov Model Iterations [116]. Given that RpoB sequences are known to contain many clade specific indels [28], we further processed this particular dataset using Gblocks [117] to remove these indels, allowing smaller final blocks, gap positions within the final blocks, and less strict flanking positions. This procedure left a total of 903 well-aligned sites for the dataset with only bacterial sequences and 788 for the dataset with archaeal sequences. The dataset of concatenated ribosomal proteins was made available by the authors aligned: after curating for unused taxa, remaining gap-only sites were removed without further processing.
Maximum Likelihood phylogenetic analysis was performed with the PhyML online service using the Smart Model Selection mode implementing the Bayesian Information Criterion for parameter selection [118,119]. Tree searching operations were calculated with the Nearest Neighbour Interchange model. Support was computed using the average likelihood ratio test [120]. Trees were visualized with Dendroscope V3.5.9 [121].

Distance estimation
Distance estimation was performed as a straightforward and intuitive approach to detect possible variations in the rate of sequence change. Distance trees were plotted using BioNJ [122] as implemented in Seaview V4.7 [123] using observed distances and 100 bootstrap replicates. Within-group mean distances were calculated for RpoB, Alpha, Beta, and the ribosomal dataset using three different substitution models as implemented in the package MEGA-X [124]: no. of differences, a Poisson model, and a JTT model. A gamma distribution to model rates among sites was used with a parameter of 1.00 and 500 bootstrap replicates.
To compare the level of sequence identity between RC proteins, two datasets of 10 random amino acid sequences were generated using the Sequence Manipulation Suit [125]. The datasets contained sequences of 350 and 750 residues. These were independently aligned as described above, resulting in 45 pairwise sequence identity comparisons for each dataset. These random sequence datasets were used as a rough minimum threshold of identity. Alignments of RC proteins were generated using three representative sequences spanning known diversity. Cyanobacterial CP43, CP47, standard D1 and D2 sequences were from Gloeobacter violaceus, Stanieria cyanosphaera, and Nostoc sp. PCC 7120; Heliobacterial PshA from Heliobacterium modesticaldum, Heliobacillus mobilis, and Heliorestis convoluta; PscA from Chlorobium tepidum, Prosthecochloris aestuarii, Chlorobium sp. GBchlB, Chloracidobacterium thermophilum and Chloracidobacterium sp. CP2-5A; Proteobacterial L and M from Methylobacterium sp. 88A, Roseivivax halotolerans, and Blastochloris sulfoviridis; and L and M from Chloroflexus sp. Y-400-fl, Roseiflexus castenholzii, and Oscillochloris trichoides.

Rates and molecular clocks
To measure the rates of amino acid substitution of PSII, ATP synthase and FtsH subunits under varying evolutionary scenarios we followed a methodology modified from Cardona et al. [19] and as detailed in Supplementary Text S2 Quantification of rates of evolution, Rationale and Method. For this experiment, all used rates of amino acid substitutions and divergence times were calculated with Phylobayes 3.3. We applied a log normal autocorrelated model under the CAT+Γ non-parametric model of amino acid substitution and a uniform distribution of equilibrium frequencies. We preferred a CAT model instead of a CAT+GTR as the latter only outperforms the former on sequence alignments of over 1000 sites and it is also much less computationally expensive [126].
Four discrete categories for the gamma distribution were used and four chains were executed in parallel through all experiments carried out in this study. The rates of evolution were retrieved from the output files of Phylobayes and expressed as amino acid substitutions per site per Ga. These rates are calculated by the software as described by the developers elsewhere [127,128].
To have a closer understanding of the spans of time between Cyanobacteria and MSV and their associated rates of evolution under different evolutionary scenarios, we applied a molecular clock to the phylogeny of RpoB sequences described above. We implemented 12 calibrations and several root priors, which are described and illustrated in Supplementary Text S3 Calibrating the RpoB and ribosomal proteins phylogenies. Molecular clocks were calculated as described above. We also compared a range of models and parameters including an uncorrelated gamma model of rates of amino acid substitution, +GTR, and the application of birth-death priors combined with soft bounds. When soft bounds were used a 2.5% tail probability was allowed to fall outside the minimum and maximum boundary, or 5% in the case of a single boundary. The full set of comparisons is presented in Supplementary  Figs. S6 and S7.

Ancestral sequence reconstruction
Ancestral sequence reconstruction (ASR) of D1 and D2 amino acid sequences was carried out with a dataset collected on the 17th of November 2017. Duplicates and partial sequences were removed, leaving 755 D1 and 248 D2 sequences. CD-HIT [129] was used to remove sequences with greater than 92% sequence identity to create a representative sample. The L and M RC subunits from 5 strains of Proteobacteria were used as outgroup. The final alignment did not include the atypical variant D1 sequence from Gloeobacter kilaueensis (NCBI accession AGY58976.1) as this showed an unstable phylogenetic position in this dataset. Maximum likelihood trees used as input for ASR were computed with PhyML using Smart Model Selection [119]. The LG substitution model with observed amino acid frequencies and four gamma rate categories exhibited the best log likelihood (LG+Γ+F) (Supplementary Table S7). We used the top four models for tree reconstruction (LG+Γ+F, LG+Γ+I+F, LG+Γ and LG+Γ+I) in addition to another tree computed using the WAG substitution model with observed amino acid frequencies and four gamma rate categories (WAG+Γ+F). These trees were used as input trees to calculate maximum likelihood ancestral states at each site for the node corresponding to the homodimeric reaction protein, D0. Three ASR programs were used for the reconstructions: FastML [130], Lazarus [131] (a set of Python scripts which wraps PAML [132]) and MEGA-7 [133]. The substitution model used by all three programs corresponded to the substitution model used by PhyML to generate the specific input tree. Branch lengths of the trees were fixed as these were already estimated for the input trees. In FastML, a maximum likelihood method of indel reconstruction using a probability cut-off of 0.7 was used. In MEGA-7, all sites were used for analysis with no branch swap filter. In Lazarus, the '-gapcorrect' option was used to parsimoniously place indels.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.