Evolutionary Radiation Pattern of Novel Protein Phosphatases Revealed by Analysis of Protein Data from the Completely Sequenced Genomes of Humans, Green Algae, and Higher Plants

In addition to the major serine/threonine-speciﬁc phosphoprotein phosphatase, Mg 2 1 -dependent phosphoprotein phosphatase, and protein tyrosine phosphatase families, there are novel protein phosphatases, including enzymes with aspartic acid- based catalysis and subfamilies of protein tyrosine phosphatases, whose evolutionary history and representation in plants is poorly characterized. We have searched the protein data sets encoded by the well-ﬁnished nuclear genomes of the higher plants Arabidopsis ( Arabidopsis thaliana ) and Oryza sativa , and the latest draft data sets from the tree Populus trichocarpa and the green algae Chlamydomonas reinhardtii and Ostreococcus tauri , for homologs to several classes of novel protein phosphatases. The Arabidopsis proteins, in combination with previously published data, provide a complete inventory of known types of protein phosphatases in thisorganism.Phylogenetic analysis of these proteinsrevealsa pattern of evolution wherea diverse set of protein phosphataseswaspresentearlyinthehistoryofeukaryotes,andthedivisionofplantandanimalevolutionresultedintwodistinctsetsofproteinphosphatases.Thegreenalgaeoccupyanintermediateposition,andshowsimilaritytobothplantsandanimals,dependingontheprotein.Ofspeciﬁcinterestarethelackofcelldivisioncycle(CDC)phosphatasesCDC25andCDC14,andtheseemingadaptationofCDC14asaproteininteractiondomaininhigherplants.Inaddition,thereisadramaticincreaseinproteinscontainingRNApolymeraseC-terminaldomainphosphatase-likecatalyticdomainsinthehigherplants.ExpressionanalysisofArabidopsisphosphatasegenesdifferentiallyampliﬁedinplants(speciﬁcallytheC-terminaldomainphosphatase-likephos-phatases)showspatternsoftissue-speciﬁcexpressionwithastatisticallysigniﬁcantnumberofcorrelatedgenesencodingputativesignaltransductionproteins.

The phosphorylation and dephosphorylation of proteins has been found to modify protein function in a multitude of ways (Cohen, 2002). The protein kinase content (kinome) of many eukaryotes and their evolutionary relationships have been studied in depth, revealing both the importance and diversity of these proteins (Manning et al., 2002a(Manning et al., , 2002bCaenepeel et al., 2004;Champion et al., 2004). With the exception of the few phosphatidyl inositol 3-kinase-like kinases, the protein kinases share a highly conserved catalytic domain. In contrast, the protein phosphatases are more diverse, displaying three different catalytic signatures, and thus can be divided into three broad groups (Moorhead et al., 2007). Although many of the phosphatases have been cataloged in the genomes of several organisms (Koh et al., 1997;Kerk et al., 2002;Alonso et al., 2004), this list continues to grow and, in organisms like higher plants, classification schemes are often quite incomplete.
Protein phosphatases were originally identified as enzymes responsible for dephosphorylating Ser and Thr residues on enzymes involved in mammalian glycogen metabolism. Purification of these enzymes, cloning, and genomics have revealed that this group is composed of two families (Ser/Thr-specific phosphoprotein phosphatase [PPP] and Mg 21 -dependent phosphoprotein phosphatase [PPM]) that represent the major group of Ser and Thr phosphatases in eukaryotes (Table I; Rayapureddi et al., 2003;Alonso et al., 2004;Gohla et al., 2005;Moorhead et al., 2007). Ten years after the cloning of the first Tyr kinase, the first Tyr phosphatase was purified and then cloned. Its catalytic signature (C[X] 5 R) defined the large protein Tyr phosphatase (PTP) superfamily (Table I), which now, in addition to the Tyr specific enzymes, includes enzymes that specifically dephosphorylate Ser or Thr as well as Tyr (the dual specificity phosphatases [DSPs]), mRNA, and phosphoinositides. Based on this catalytic signature, the group has expanded to 107 transcribed genes in humans (Homo sapiens). Several of these are catalytically inactive but their gene products function in the cell, and several have been linked to human diseases (Robinson and Dixon, 2006). The third major group of phosphatases was identified most recently and is characterized by a catalytic signature DXDXT/V and is referred to here as the Asp-based enzymes ( Table I). The phosphatase responsible for dephosphorylation of the C-terminal domain (CTD) of RNA polymerase II (Pol II; FCP1) was the first enzyme of this group to be identified as a protein phosphatase and the related small CTD phosphatases (SCPs) have been recognized as part of the group. Several haloacid dehalogenase (HAD) superfamily members, such as EYES ABSENT (EYA), which acts as a transcription factor, and chronophin, which controls cofilin phosphorylation, have now been demonstrated to function as protein phosphatases. Like FCP1 and SCP enzymes, HAD superfamily members have a DXDXT/V catalytic signature, utilizing a unique Asp-based catalytic mechanism. The HAD superfamily is potentially very large, but to date only a few members have been demonstrated to display Ser or Tyr phosphatase activity (Moorhead et al., 2007).
A catalog of enzymes that comprises the PPP, PPM, and some of the PTP family members of Arabidopsis (Arabidopsis thaliana) was presented several years ago (Kerk et al., 2002). Since then both the number of known protein phosphatases and the set of completely sequenced reference genomes have expanded considerably. In this situation a systematic revisitation of the protein phosphatase repertoire is warranted. We have focused on these new or novel phosphatases by using the 11 phosphatase classes from humans to identify homologs from the genomes of Arabidopsis, green algae (Chlamydomonas reinhardtii and Ostreococcus tauri), Oryza sativa, and Populus trichocarpa. These proteins, along with their counterparts in humans and other animals, were analyzed to determine their interrelationships. When combined with previous studies from our laboratory, this work defines a complete set of all known varieties of protein phosphatases in Arabidopsis.

RESULTS AND DISCUSSION
Homologs were identified using BLAST searches as well as hidden Markov models (HMMs) of the catalytic domains. The overall results of this study are summarized in Table II. This lists the number of homologs for each protein phosphatase type that was found in each of the subject organism-specific databases. The structural classes are derived from table I in the recently published study of Moorhead et al. (2007). Results will be discussed in the order of their appearance in this table, followed by an analysis of expression and promoters of a subset of the identified genes. Evidence of expression of all proteins is summarized in Supplemental Table S1 (see ''Materials and Methods'' for details).

SSU72
Using the sequence of human SSU72 (gi:7661832) to search the target protein databases, we found one Table I. Summary of protein phosphatase gene types in Arabidopsis and human Protein phosphatase genes are summarized from this study, plus the following references: Kerk et al. (2002); Kim et al. (2002); Schweighofer et al. (2004); Kerk (2007); and Moorhead et al. (2007). Multiple protein isoforms are often transcribed from the same gene, but are ignored in this table, for simplicity. A number of additional sequences have been identified that contain similarity to the Arabidopsis genes, but have been rejected because they lack critical class-specific catalytic residues (see Kerk, 2007). Since the report in Kerk et al. (2002), excluding the data from this study, the following changes have occurred to the Arabidopsis protein phosphatase gene set: three PPPs have been added (At1g18480 [Bacterial-like]; At3g19980 [PP6]; At5g43380 [PP1]); six DSPs have been added (At3g09100, At3g19420 [PTEN homolog], At3g62010, At4g03960, At5g01290, and At5g28210); eight PP2Cs have been added (At2g30170, At2g46920, At4g03415, At4g11040, At4g16580, At1g17550, At4g33500, and At5g66720); one DSP has been deleted (At3g55270-reannotated with a greatly shortened DSP domain); and one PP2C has been deleted (At1g75010-determined to be a false positive). N/A, Not applicable. Phosphatases of known family but with no direct homologs to characterized mammalian proteins (includes At1g03445, At1g07010, At1g08420, At1g18480, At2g27210, and At4g03080). candidate homolog each in C. reinhardtii (Cre130182), O. tauri (Ot16g02480), Populus (Pop775960), Arabidopsis (At1g73820.1), and O. sativa (Os12g07050.1 and .2). We constructed a multiple sequence alignment, which is presented as Supplemental Figure S1. Phylogenetic trees show that both of the algal sequences cluster with the higher plant sequences in two of the three inference methods with high bootstrap support (82.2% maximum parsimony [Pars]; 80.0% maximum likelihood [ML]).

Slingshot
Slingshots, along with chronophin (included in ''Gene Expression''), dephosphorylate cofilin and, as a result, stimulate the depolymerization of F-actin (Huang et al., 2006). We used the human sequences SSH1 (gi:40254884), SSH2 (gi:37674210), and SSH3 (gi:24586675) to search the target protein databases. We found no candidate slingshot homologs in the algae, nor in any of the plant species, indicating that this method of regulation is animal specific.

CDC14
CDC14 belongs to the family of PTPs and is responsible for control of mitotic exit in organisms of the fungi/ metazoan group (Trinkle-Mulcahy and Lamond, 2006). We used the human proteins CDC14A (gi:55976620) and CDC14B (gi:55976216) to search the target protein databases. A single candidate CDC14 homolog was found in C. reinhardtii (Cre112184). Our work shows clearly through several different methods that this sequence has all the structural features expected of a true CDC14. First, analysis with the FFAS03 technique (Rychlewski et al., 2000) shows that there is sequence similarity between Cre112184 and the solved structure of human CDC14B (Protein Data Bank entry: 1ohc; Gray et al., 2003) extending for over 300 amino acids, encompassing both the upstream A (unique to CDC14) and the downstream B (PTP/DSP catalytic) domains. The score for this comparison is very high (Z ; 90 versus Z ; 110 for HuCDC14B versus itself; for this technique a Z score of 9.5 or greater is considered significant). Second, the FFAS03 alignment shows high conservation (10/11) of a set of critical residues described in the solved structure of human CDC14B [these include the canonical PTP/DSP catalytic residues (HC[X] 5 R) in the B domain, plus a number of others unique to CDC14s]. Third, a multiple sequence alignment was constructed encompassing the B domain of animal CDC14s and, as an outgroup, the protein phosphatase domains of a set of DSPs previously characterized from Arabidopsis (Kerk et al., 2002(Kerk et al., , 2006presented as Supplemental Fig. S2). The corresponding phylogenetic tree is presented as Figure 1. It is clear that in this tree Cre112184 is part of a clade with human and Xenopus laevis CDC14s, sharing a common is applied to each of them, there is similarity to the solved structure of human CDC14B (Protein Data Bank entry: 1ohc) along a 300-amino-acid region that encompasses both the A and B domains. The scores for these comparisons are strong, between Z 5 20 and Z 5 50. (Note, however, that these scores are much weaker than those obtained with the C. reinhardtii sequence Cre112184, presented above.) The O. tauri sequence (Z ; 50) retains the canonical PTP/DSP catalytic residues (HC[X] 5 R), and therefore could be enzymatically active. It retains some of the hydrophobic pocket residues described in the solved structure of human CDC14B, but not all of them, and has only two of six acidic residues in the acidic groove region. In the solved structure these acidic residues are thought to be critical to binding basic residues in the target cyclin-dependent kinase, whereas the hydrophobic pocket residues are responsible for maintaining substrate specificity of CDC14 for Pro in the pSer 1 1 position (Gray et al., 2003). Therefore, although structural resemblance is readily apparent, it is doubtful that this domain could function specifically as a CDC14. The higher plant sequences obtain FFAS03 scores of between Z 5 22 and Z 5 36. They lack nearly all of the set of specific CDC14 residues, and furthermore have a C to S substitution in the PTP/DSP catalytic loop sequence. They therefore could not function as catalytic domains. A multiple sequence alignment encompassing the length of the full A and B domains of the CDC14-like sequences is presented as Supplemental Figure S3.
When considered in the context of the shorter multiple sequence alignment (Supplemental Fig. S2) and the subsequent phylogenetic tree (Fig. 1), it is apparent that sequence Ot15g00870 and the higher plant sequences form a second, distinct clade, sharing a common node with high bootstrap support in all tree inference methods (99.1% NJ; 98.4% Pars; 82.7% ML). Finally, the C. reinhardtii/animal cluster and the O. tauri/higher plant cluster are clearly related, compared with the generic Arabidopsis DSPs, sharing a common node in all three tree inference methods, with high bootstrap support in two of them (88.6% NJ; 95% Pars; 40.2% ML). It is thus very clear that all these sequences are CDC14 like and evolved from a common ancestor, if not from CDC14 itself. All other clusters and nodes in this DSP tree are as previously published (Kerk et al., 2006; data not shown).
As an additional note of interest, the higher plant CDC14-like sequences are found as a domain on NADKs. NADK2, the Arabidopsis protein containing the domain, is a chloroplast-localized protein that has been shown to be a calcium-dependent calmodulinregulated protein (Turner et al., 2004;Chai et al., 2005). Calmodulin binding takes place on the N terminus of the protein, which is the location of the CDC14-like domain. The calmodulin-binding site was mapped to a 45-amino-acid region of the domain, presented in figure 7 of Turner et al. (2004). When comparing with our multiple sequence alignment, the heart of this binding motif is the mutated PTP consensus site (H[C/S]X 5 R). This is reminiscent of substrate trapping mutants observed at the altered consensus catalytic motifs of other PTPs (Bliska et al., 1992;Milarski et al., 1993;Sun et al., 1993).

CDC25
The CDC25 proteins also have a role in control of progression through the cell cycle in fungi and metazoans. Although CDC14 controls mitotic exit, CDC25 is involved in the transition from G2 to M phase (Trinkle-Mulcahy and Lamond, 2006 gi:125625350]). Using these sequences to search the target databases, we found a number of similar sequences, whose protein phosphatase catalytic domains are collected in the multiple sequence alignment presented as Supplemental Figure  S4. The resulting phylogenetic tree is presented as Figure 2. O. tauri has a candidate CDC25 homolog (Ot02g05470) that is a member of a CDC25 clade encompassing human, animal, and yeast (Saccharomyces cerevisiae and Schizosaccharomyces pombe) sequences, sharing a common node with high bootstrap support in all three tree inference methods (100% NJ; 97.8% Pars; 75.9% ML). Previous work supports this, as this gene has previously been cloned, and the expressed protein acted as a CDC25 in both yeast complementation and starfish oocyte cell division assays (Khadaroo et al., 2004). They noted the divergence of the N-terminal domain of this sequence (see Supplemental Fig. S5 for an alignment of the O. tauri and human sequences), but asserted that there was a conserved 14-3-3-binding site and several potential phosphorylation sites. Presumably the 14-3-3-binding site reported was 252-RPLASPP-258, the closest match to either of the consensus binding sites (Rxxx[S/T]xP in this case), which, while matching the consensus, has a Pro in both the S 2 3 and S 1 1 positions. This has been shown to be unfavorable (Yaffe et al., 1997), making it unlikely that the O. tauri protein has any 14-3-3-binding capability. Thus, although this protein clearly can act like a true CDC25 in functional assays and we support its classification as a CDC25 based on our sequence analysis, we anticipate that its regulation in vivo might well differ from that previously described for the fungal/animal proteins. Our searches also revealed sequences sharing some similarity in C. reinhardtii (Cre153947, Cre171654, Cre183511, Cre167673), P. trichocarpa (Pop282198), Arabidopsis (A5g03455.1), and O. sativa (O10g39860, O03g01770). The Arabidopsis sequence was originally published as a CDC25 (Landrieu et al., 2004), but more recent work indicates that it may functionally be an arsenate reductase, indicated by a lack of arsenate reductase activity in a T-DNA plant line, and in vitro arsenate (V) reductase activity (Bleeker et al., 2006;Dhankher et al., 2006). The sequences of several known arsenate reductases from fern (Pteris vittata; PvACR2), yeast (ScACR2, SpACR2), and the protist Leishmania major (LmACR2) form a clade with the above candidate algal and higher plant sequences, with strong bootstrap support in two phylogenetic tree inference methods (98.8% Pars; 82.7% ML), supporting the categorization of these proteins as arsenate reductases. Our findings confirm and extend, with larger sequence sets, the phylogenetic analyses previously reported (Dhankher et al., 2006;Ellis et al., 2006). The higher plant and C. reinhardtii sequences also lack any significant N terminus, which is known to contain regulatory sites in animal CDC25s, such as phosphorylation and 14-3-3 protein-binding sites. It is apparent that the algal/plant proteins lack the conserved regulatory sites, as well as the catalytic activity, of the fungal/ animal CDC25s. This viewpoint is supported by a recent article questioning the existence of CDC25 in higher plants, and suggesting that cell cycle control has been reorganized along lines distinctly different than the fungal/metazoan (presumably ancestral) model (Boudolf et al., 2006). It seems that with our data, an expansion of this concept is in order. It appears as if this reorganization of cell cycle control has occurred in higher plants, and that this process began during the radiation of the green algae. When combined with the above Figure 2. Phylogenetic tree of CDC25-like sequence relationships. A rectangular cladogram was generated by comparing catalytic domains of CDC25-like proteins (red) with the closest relatives in plants and fungi (blue). Proteins included are from the following organisms, with the source of the sequencesinparentheses:Arabidopsis (MIPS code without ''t''); C. reinhardtii (Crexxxxxx, where xxxxxx is the protein identification from http:// plantsp.genomics.purdue.edu/plantsp/ data/proteins.Chlre3.fasta); Danio rerio (Drxxxxxxxx, where xxxxxxxx is the gi); humans (CDC25A_Hu:NP_ 001780, CDC25B_Hu:NP_068659, CDC25C_Hu:NP_001781); L. major (LmACR2: GenBank AAS73185); O. sativa (MIPS code without ''s''); O. tauri (MIPS codes given from https:// bioinformatics.psb.ugent.be/gdb/ ostreococcus/); P. trichocarpa (Popxxxxxx, where xxxxxx is the protein identification from DOE JGI); fern (PvACR2: GenBank ABC26900); S. cerevisiae (ScCDC25, NP_013750; ScACR2, NP_015526); S. pombe (SpCDC25, NP_592947; SpACR2, NP_595247); X. laevis (Xlexxxxxx, where xxxxxxxx is the gi, CDC25A_ Xle:NP_001081257). Multiple sequence alignment construction and phylogenetic tree inference was performed as detailed in ''Materials and Methods''. The tree topology shown is that from ML, where 10,000 replicates were performed. The known CDC25 proteins (red) form a clade with the sequence from O. tauri (node A: 100% NJ; 97.8% Pars; 75.9% ML), whereas the most closely related plant proteins cluster with the arsenate reductases (blue; see text for details; node B).
Kerk et al.
information on CDC14 in plants, it appears that different algal species are frozen at different points in this reorganization, some retaining one mitotic phosphatase or the other, but both lost by the transition to higher plants. Study of these algal species, as well as the higher plants, may give a unique insight into the evolution of these processes.

Low-Molecular-Weight PTPs
We used the sequence for the low-molecular-weight PTPs (LMWPTPs) from humans (ACP1 [gi:1709543]) to search the target protein databases. We found one candidate homolog in C. reinhardtii (Cre117512), none in O. tauri, two in P. trichocarpa (Pop821042, Pop594818), one in Arabidopsis (At3g44620.1), and one in O. sativa (Os08g44320.1). The multiple sequence alignment constructed from these sequences is presented as Supplemental Figure S6. As seen in the alignment, the proteins are remarkably conserved, indicating an essential, conserved function for the protein in eukaryotes. The lack of a homolog in O. tauri is puzzling, however, and is possibly the result of secondary loss of the protein because the O. tauri genome is remarkably streamlined (Derelle et al., 2006).

Asp-Based Catalysis: FCP Like
The TFII-interacting RNA Pol II CTD protein phosphatase FCP1 is an essential yeast protein that acts to dephosphorylate the CTD of the largest subunit of RNA Pol II (Archambault et al., 1997). This subunit contains an array of repeats of a heptad unit containing Ser residues at positions 2 and 5. The transcription initiation and elongation process consists of a variety of mRNA modifying proteins being recruited to the Pol II complex. This appears to be modified by the state of Pol II phosphorylation-it is recruited to the complex in a hypophosphorylated state, phosphorylated during the transcription process, then dephosphorylated to allow termination and recruitment to a new complex (Meinhart et al., 2005;Moorhead et al., 2007). Complex modifications of the phospho-array are thus possible and with it modulation of the transcription process. FCP1 is a metal-binding protein, possessing a DXDXT/V motif that is essential to catalytic activity (Kobor et al., 1999). The isolated phosphatase domain is sufficient for catalytic activity. In plants and algae, we found a large set of proteins sharing a degree of similarity to this prototype sequence. A large multiple sequence alignment of the protein phosphatase catalytic domain of 99 sequences was constructed (presented as Supplemental Fig. S7) and the corresponding phylogenetic trees inferred (Fig. 3). The Arabidopsis proteins CPL1 (CTD phosphatase-like protein phosphatase) and CPL2 contain an FCP-like catalytic domain; however, they and their homologs in other plants are further characterized by the presence of one or two double-stranded RNA (dsRNA)-binding domains, and are discussed separately from the other members of this family. The trees were composed of several distinct subclusters, which are each presented in turn, based upon the topology of the NJ tree. The amino acid residues required for protein phosphatase catalytic activity have been well studied in FCP1 (Hausmann and Shuman, 2003;Hausmann et al., 2004). A set of 11 critical sequence positions have been identified through biochemical analysis and are indicated in Supplemental Figure S7. The majority of sequences in this FCP1-like data set retain conservation at all these residue positions. However, 44 sequences deviate from the yeast residue pattern in at least one position (see Supplemental Fig. S7 legend).

SCP (Subcluster A)
SCP proteins are small RNA Pol II CTD protein phosphatases. In humans there are three proteins: SCP1 (gi:15278033), SCP2 (gi:31074179), and SCP3 (gi:34392247). We found one sequence in Chlamydomonas (Cre149388) and one in Ostreococcus (Ot12g02910) that share similarity with the animal proteins in the phosphatase catalytic domain, and none in the higher plants. Upon multiple sequence alignment and phylogenetic tree inference, these algal sequences form part of a clade with the human and other animal proteins, sharing a common node with varying bootstrap support in the three tree inference methods (99.7% NJ; 44.8% Pars; 34.7% ML). This support is not to our minimum threshold of majority support in two of three methods. To clarify the situation, sequences were added (duplicates with different accession numbers) and removed (more divergent sequences) from the alignment used to generate the trees. In both these situations, the modified alignments, with more or less sequences, met our requirements of support of two of three methods. However, N-terminal (noncatalytic domain) motif analysis shows that the O. tauri sequence does not share motifs found in the animal sequences, and the C. reinhardtii sequence lacks this N-terminal region. Thus, while not unequivocal, these data support the assignment of these two algal sequences to the SCP cluster.
Subcluster C comprises human Dullard and its animal homologs (100% NJ; 100% Pars; 98.4% ML). Dullard is a fairly recent discovery in this gene family, and has been implicated in neural tube development, in the BMP (bone morphogenetic protein) pathway, and in nuclear membrane biogenesis (Satow et al., 2002(Satow et al., , 2006Kim et al., 2007). N-terminal sequence analysis shows that these sequences share common motifs. The presence of a Dullard homolog in yeast, involved in a conserved pathway (Kim et al., 2007), and the lack of homologs in algae/plants suggest it arose after the plant/animal evolutionary split.
Subcluster D is a group of six higher plant sequences (Pop545127, A1g29780.1, A1g29770.1, A5g45700.1, O05g11570, and Pop235176). This group has high bootstrap support in all three tree inference methods (99.4% NJ; 91.8% Pars; 80.6% ML). N-terminal sequence analysis shows that these sequences share common motifs. Subcluster E contains both algal and higher plant sequences (Cre111940, Pop560916, Pop659512, A3g55960.1, Os01g61640, and O05g39070). This group receives high to moderate bootstrap support in two tree inference methods (99.3% NJ; 60.46% ML) and these sequences share common N-terminal motifs. Subcluster F contains two C. reinhardtii sequences that are not splice variants (Cre187551 and Cre142839) and have high to moderate bootstrap support in two of three inference methods (99.7% NJ; 79.4% ML). The proteins of subclusters D to F have yet to be characterized.

Subcluster G
This is a set of eight sequences from animals, algae, and higher plants (TIM50Hu, Cre169672, Ot03g04550, Pop642296, Pop568582, A1g55900.1, O05g43770, and O1g55700.1). This group receives moderate bootstrap support in all three tree inference methods (82.0% NJ; 88.7% Pars; 74.7% ML). N-terminal motif analysis shows that most sequences in this cluster share a com-mon signature. TIM50 only weakly shares some elements of this signature, and is clearly the most distantly related sequence in this group. TIM50 (sometimes referred to as TIMM50) is the homolog of the yeast protein of the same name, and is named as translocase of inner mitochondrial membrane 50 kD. As indicated by its name, TIM50 is involved with the translocation of proteins through the inner membrane into the matrix, as part of the TIM23 complex (mitochondrial protein import; for review, see Neupert and Herrmann, 2007), although a nuclear-localized isoform has also been identified in humans (Xu et al., 2005). The Arabidopsis protein was identified in a previous proteomic characterization of mitochondrial import proteins, demonstrating the conservation of the localization of this protein, at the very least (Lister et al., 2004).The human isoform has also been shown to possess phosphatase activity, intriguingly active on phosphorylated Ser, Thr, and Tyr (Guo et al., 2004). This subcluster is the only one containing proteins that we can be all but assured are not involved in the dephosphorylation of the CTD (due to their mitochondrial localization), raising questions not only about the substrate specificity of other FCP-like proteins but also about the specific target of dephosphorylation by these TIM50 homologs.

''FCP Assemblage'' (Subcluster H)
This is a large group of sequences (24) from animals, algae, and higher plants. The assemblage as a whole receives moderate to low bootstrap support from all three tree inference methods (64.1% NJ; 67.6% Pars; 43.7% ML; note that several sequences are excluded from the subcluster in the Pars tree; see the Fig. 3 legend for details). Within it are distinct subclusters formed by the yeast/animal FCP1 group (FCP1_Spomb, FCP1_Hu, Dr49618915, Dr94734487, FCP1_Xle, and Xt62858037), the higher plant CPL3 (Pop708815, CPL3_Ath, and OsCPL3), and CPL4 (Pop262722, CPL4_Ath, and OsCPL4) groups, and associated algal sequences (Cre187332, Cre141879, Ot04g02710, and Ot03g04040). Of particular note is a subcluster made up exclusively of Arabidopsis sequences (A5g23470.1, O. tauri (MIPS codes given from https://bioinformatics.psb.ugent.be/ gdb/ostreococcus/); P. trichocarpa (Popxxxxxx, where xxxxxx is the protein identification from DOE JGI); S. pombe (FCP1_ Spombe:NP_594768); X. laevis (FCP1_Xle:NP_001081726); Xenopus tropicalis (Xtxxxxxxxx, where xxxxxxxx is the gi, with the following exceptions from Ensembl: 39992_Xtr:ENSXETP00000039992, 32705_Xtr:ENSXETP00000032705). Multiple sequence alignment construction and phylogenetic tree inference was performed as detailed in ''Materials and Methods''. The tree topology shown is that from NJ, where 1,000 replicates were performed. The proteins segregate into 10 subclusters, which are labeled, color coded, and discussed in the text. The support for each of the labeled nodes is as follows: A2g02290.1, A1g20320.1, A1g43600.1, and A1g43610.1). There is also a subcluster of closely related sequences from the algae as well as P. trichocarpa (Ot03g00770, Cre149314, and Pop560900). Each of these subclusters receives high bootstrap support, but their relative topological interrelationships within the assemblage varies slightly among the different tree inference methods.
N-terminal motif analysis of these sequences indicates that some of these sequences share more than simply the same catalytic domain. The higher plant CPL3s have a distinct motif signature (data not shown). Elements of this signature are weakly shared by the two algal sequences: Ot03g00770 and Cre187332. This indicates that these sequences are most closely related to the CPL3s. The higher plant CPL4s also have a distinct motif signature (data not shown). The algal sequence Ot03g04040 shares this motif signature, as does the algal sequence Cre141879 (though with reduced similarity). Finally, the CPL4 upstream motif signature is shared by the Arabidopsis sequences A1g20320, A2g02290, and A5g23470. It is likely that all these CPL4-like sequences are related. The animal FCP1s share a common motif signature, which is shared to some extent by yeast FCP1. The CPL4-like sequences share elements of this motif signature with the FCP1s, whereas the CPL3s do not, suggesting a closer relationship of CPL4-like and FCP1 groups. The algal sequence Ot04g02710 shares elements of the FCP1 motif signature, suggesting it is more closely related to the animal and yeast FCP1s. Sequences A1g43600 and A1g43610 have nonexistent or short N termini, respectively, and are likely to be regulated differently from other proteins in this cluster.
Fungal and animal FCP1 proteins have, in addition to the protein phosphatase catalytic domain, a downstream phosphoprotein-binding BRCA-related C-terminal (BRCT) domain. The only sequences in our data set containing this domain are in the ''FCP Assemblage'', confirming the relationship of these algal and plant sequences to the FCP1s. However, although most sequences contain the BRCT domain, some do not. Fifteen of the 24 sequences in the cluster contain it, with the exception being the small Arabidopsis-only sequence cluster (five sequences), the three algal sequences Ot03g00770, Cre141879, Cre149314, and the P. trichocarpa sequence Pop560900, which is likely a CPL1/2 relative but was included here because of its ambiguity. Multiple sequence alignment of the BRCT domain sequences show them to be well conserved, and we would therefore expect them to be functional (data not shown). Although the Arabidopsis sequences without BRCT domains are likely a result of secondary loss of the domain, the algal sequences could also be an indicator of the original state of the FCP-like proteins, before gaining the BRCT domain.
Limited study of the Arabidopsis CPL3 and CPL4 proteins sheds some light on the comparative function of these proteins in plants. Both of these proteins contain a functional BRCT domain, which binds to AtRAP74, a homolog of animal/yeast TFIIF (Bang et al., 2006). Knockout plants for CPL3 display hyperactivation of abscisic acid (ABA)-mediated transcription, as well as a general alteration of plant growth and maturation, which can be duplicated with mutations to either the BRCT or catalytic domains (Koiwa et al., 2002). RNAi knockdown of CPL4 also leads to plant growth and maturation defects (Bang et al., 2006).

Subcluster I
This cluster is composed of 10 plant and algal sequences (O07g01850, A4g261190.1, Pop290290, A3g29760.1, Pop296466, Pop195150, O3g54870.1, O3g54850.1, Ot05g00010, and Ot05g00160; Cre166215 is also included depending on tree). This group receives a range of support from the three tree inference methods (99.5% NJ; 60.4% Pars; 68.1% ML; Cre166215 is included in this subcluster in the Pars and ML trees). The sequences do not seem to have any significant relation outside of the catalytic domain.

Subcluster J
This cluster is composed of eight plant, animal, and algal sequences (Cre115803, Ot01g02920, and MGC10067Hu [also known as UBLCP1, ubiquitinlike domain-containing C-terminal phosphatase 1; Zheng et al., 2005], O01g6450, O1g65450.2, A4g06599.1, Pop274109, and Pop830623). This group receives high to moderate support from the three tree inference methods (100% NJ; 99.0% Pars; 58.2% ML). This group is defined by the presence of a ubiquitin-like domain on the N terminus of the proteins. This domain is listed in the National Center for Biotechnology Information conserved domain database, as cd01813, as shared with ubiquitin-specific proteases; however, all entries appear to be CTD phosphatase homologs. Despite this, these proteins do have what appears to be a proteasome interacting motif based on the work of Upadhya and Hegde (2003). The human protein in this group, UBLCP1 (listed on the tree and alignment as MGC10067), has been studied and determined to be a functional CTD phosphatase with a possible preference for Ser-5 (Zheng et al., 2005). The combination of the ubiquitin-directed proteolysis and RNA Pol II phosphatase activity, in addition to the apparent conservation, make this group of proteins intriguing, and further study of their role in the cell is awaited.

CPL1 and CPL2
CPL1 and CPL2 are CTD phosphatase-like protein phosphatases initially described in Arabidopsis (At4g21670.1 and At5g01270.1, respectively; Koiwa et al., 2002Koiwa et al., , 2004. As mentioned above, these FCPlike phosphatases are characterized by the presence of dsRNA-binding domain(s) on the C terminus (two in CPL1 and one in CPL2). We used the Arabidop-sis sequences to search the target protein databases. We found four candidate homologs in P. trichocarpa (Pop555554, Pop743771, Pop90064, and Pop560900), and six candidate homologs in O. sativa (O02g42600, O01g63820, O1g63820.2, O4g44710.1, O4g44710.2, and Os38346621; the last being a possible isoform of the previous two). The multiple sequence alignment encompassing these full-length sequences and that of the Arabidopsis proteins is presented in Supplemental Figure S8. From an inspection of the C-terminal region of this alignment, it is evident that six of these newly identified sequences have two full predicted RNAbinding domains, with a high degree of similarity to CPL1_Ath, and are therefore CPL1s (O02g42600, Os38346621, O4g44710.1, O4g44710.2,Pop555554,and Pop743771). In contrast, two of the new sequences (Pop90064 and Pop560900) have a greatly truncated second RNA-binding region (very similar to CPL2_Ath) and are therefore CPL2 proteins. The situation with the remaining new sequence is more complex.
The sequence O01g63820 (both isoforms) occupies an intermediate position between the well-defined CPL1 and CPL2 clusters in the phylogenetic trees. There is disagreement between tree inference methods as to whether it is included within the CPL1 cluster (NJ) or the CPL2 cluster (Pars). Although the protein contains a well-conserved second RNA-binding domain, the first domain contains a 12-residue deletion within a normally conserved region, requiring experimental confirmation of function. The sequence has other peculiarities, which might preclude it being a functional (protein) phosphatase. There are several prominent deletions (approximately 405-440, approximately 550-625, approximately 635-680, approximately 700-725, and approximately 735-785, as on the scale in Supplemental Fig. S8). In addition, two residues that are known to be critical to the activity of yeast FCP1 (the ''DD'' at about position 405 of the alignment) are not conserved, although they are also not conserved in several other proteins, including the members of the TIM50 subcluster (subcluster G), despite the demonstration of phosphatase catalytic activity of human TIM50 (Guo et al., 2004). However, on balance, the sequence features are most consistent with classification of this sequence as a CPL1, provided it is shown to have activity. This sequence has a particularly convoluted history because a highly similar sequence was published as ''OsCPL2'' when originally identified (Koiwa et al., 2004). Through more recent revisions of both genomic and protein databases, this protein (and the apparent isoform, or duplicate Os01g0857000, whose database entry is still provisional) has come to appear more like a divergent CPL1. N-terminal motif analysis shows that the CPL1s and CPL2s are very uniform, and have a common motif signature, which they do not share with the other FCP1-related sequences.
The Arabidopsis proteins CPL1 and CPL2 have been experimentally characterized to some degree, and their isolated catalytic domains are capable of dephos-phorylating Ser-5 of the Pol II heptad repeat (Koiwa et al., 2004). Deletion of the C terminus of the CPL1 protein, containing the dsRNA-binding domains, creates a cpl1 phenotype, although the function of the domains is not known (Koiwa et al., 2004).

Overall Observations of Sequences with an FCP-Like Domain
Proteins containing an FCP-like catalytic domain can be seen as a microcosm of the evolutionary differences between algae, higher plants, and animals on a protein level. Every possible combination of conservation is present, with the important exception of plant and animal similarity with algal differences. As mentioned above, this places modern algae directly between plants and animals, making them ideal candidates to study the earliest differences between plants and animals.

EYA
These protein phosphatases are part of the HAD family. They have been shown to mediate complex morphogenetic events in animal development (Rebay et al., 2005). We used the human sequences EYA1 (gi:26667222), EYA2 (gi:26667240), EYA3 (gi:26667243), and EYA4 (gi:98991760) to search the target protein databases. We found one candidate homolog in P. trichocarpa (Pop356606), one in Arabidopsis (At2g35320.1), and one in O. sativa (Os06g02028.1). The Arabidopsis protein possesses Asp-based catalytic activity (Rayapureddi et al., 2003); however, the function of the plant proteins is currently unknown. We found no candidate homolog in C. reinhardtii or O. tauri. The multiple sequence alignment we constructed of catalytic domains is presented as Supplemental Figure S9. In Drosophila melanogaster EYA, the prototype of this group, binding occurs between the protein phosphatase domain and the homeobox transcription factor sine oculis. A large N-terminal EYA domain then supplies transactivation functions essential for normal eye development (Pignoni et al., 1997). However, the plant homologs we have identified, including the Arabidopsis protein, lack the N-terminal domain of the animal proteins, and thus are unlikely to be directly involved in transcriptional activation. The absence of homologs in algae may indicate that, whatever the mechanism of action, higher plant EYAs may mediate functions similar to their animal counterparts, and thus have been lost in the modern green algae. Importantly, this is the sole example of animals and plants having homologs of a protein that is absent in algae.

Chronophin
Chronophin is a member of the HAD superfamily, involved in the activation of the actin filament regu- lator cofilin (Gohla et al., 2005). We used the sequence of human chronophin (gi:10092677) to search the target protein databases. We found three potential homologs in C. reinhardtii (Cre77681, Cre127857, and Cre142105), two in O. tauri (Ot08g02300 and Ot15g02680), three in P. trichocarpa (Pop55442, Pop696747, and Pop671977), three in Arabidopsis (At5g36790.1, At5g36700.1, and At5g44760.1), and two in O. sativa (Os09g08660 and Os04g41340). The multiple sequence alignment constructed from the catalytic domain region is presented as Supplemental Figure S10. In the phylogenetic trees, sequences Cre127857, Cre142105, and Ot15g02680 cluster together with the animal chronophin sequences with high to moderate bootstrap support (100% NJ; 85.4% Pars; 53.0% ML). The sequences Cre77681 and Ot08g02300 cluster with neither the plant nor the animal sequences in two of the three tree inference methods (Pars and ML). To summarize, higher plants seem to have at least one extra chronophin-like protein, and this trend includes the algae studied. However, the plant and animal chronophins cluster separately with phylogenetic study, and the algae seem to be closer related to the animals in this regard.

Gene Expression
Because of the well-studied ability of some FCP1like protein phosphatases to modify the phosphorylation state of RNA Pol II and thus to alter the dynamics of mRNA transcription, Bang et al. (2006) suggested that they might be able to act as regulators of gene expression. To investigate this possibility further, we analyzed the Affimetrix microarray expression data available for probes from this gene set.
The results are summarized in Table III. For eight of the 14 gene probes examined, there proved to be highly correlated gene sets. To further dissect the data, we defined three arbitrary categories of correlated genes: protein kinases/phosphatases, components of the ubiquitination/proteolysis system, and putative transcription factors. Our rationale was that these proteins are capable of posttranslational effects that would amplify the significance of potential gene regulatory networks.
The data for correlated gene expression for the FCP1-like CTD protein phosphatases present an interesting and varied pattern. The number of highly correlated probes varied from zero to several hundred (Table III). The sets of ''top 100'' correlated probes for all the FCP1-like driver gene probes contain substantial numbers of potential regulatory protein gene probes. There are between 15 (At2g33540 [255843_at]) and 37 (At1g43600/At1g43610 [262720_s_at]) found in each FCP1-like driver gene correlated probe set. Furthermore, the balance of gene probes in the three categories is quite varied. The statistical significance of the number of probes identified in each category was also determined, as detailed in ''Materials and Methods''. Two drivers had a ''very highly significant'' num-ber of correlated probes (P , 1E-07), three had a ''highly significant'' number (P , 1E-04), five had a ''statistically significant'' number (P , 0.01), with the remaining not statistically significant (P . 0.01). Finally, it should be pointed out that for each of the genes in the FCP1-like set, there is a single tissue, or a small set of tissues, which display a greatly enhanced level of expression (with the exception of At3g29760, which shows relatively ubiquitous expression). The limited protein data available (for CPL3 and CPL4) lends some support to this, with expression mostly in the roots, when compared to shoots (Bang et al., 2006). This is in contrast to the other more highly conserved genes in this study (e.g. EYAs and SSU72s), where there is more uniformly ubiquitous gene expression (data not shown).

Promoter Analysis
The analysis of a group of genes with highly correlated expression may serve to elucidate possible functions and common regulatory mechanisms for expression. One of the first demonstrations of this concept was for yeast and human gene sets, where the statistical measure of coexpression was hierarchical sequence clustering (Eisen et al., 1998). The results clearly established that groups of genes that share common expression patterns also share common functions. This allows inferences to be made based on previous knowledge of gene function within the set. A similar type of analysis allowed the identification of clusters of circadian-regulated genes in Arabidopsis, and, with analysis of upstream sequences, the identification of the responsible promoter ''Evening Element'' (Harmer et al., 2000). More recently, another statistical measure of gene coexpression, the Pearson correlation coefficient, has been used to document gene sets enriched in cell wall synthetic enzymes (Jen et al., 2006) and genes responsive to illumination with red light (Manfield et al., 2006). Common promoter motifs were shown to be shared by cold-responsive genes, and other genes in a highly correlated expression set (Jen et al., 2006).
We examined sets of genes whose expression was positively correlated with that of driver genes in the FCP1-like gene tree for enrichment of characterized promoter elements (P , 10 23 ). The results are summarized in Supplemental Table S2. In general terms these might be said to fall into a few major categories (stress response, development/proliferation, and defense). In broad outlines, there are apparent similarities between the promoter elements enriched in the correlated gene sets for CPL3 (At2g33540), CPL2 (At5g01270), and At5g11860 (in FCP-like subcluster 3). Elements associated with ABA predominate. Indeed, CPL3 is one of the best studied of the Arabidopsis FCP-like gene set, and based on functional data it has been proposed to be primarily an ABA response gene (Koiwa et al., 2002). It would be logical for other genes in its regulatory network to have similar char-acteristics. In contrast, CPL1 (At4g21670) has been shown to be a negative modulator of various stress responses, and mutations produce growth and maturation defects distinct from CPL3 (Koiwa et al., 2002). We find that a distinct set of promoter elements is enriched in the correlated gene set for CPL1. This is consistent with a regulatory gene network responding to different conditions than that for CPL3. Gene sets whose expression is correlated with driver probe 257378_s_at (At2g02290 and At5g23470) and driver probe 262720_s_at (At1g43600 and At1g43610; FCP1like subcluster 6) contain promoters enriched for the ''telo-box'' element. This is a motif with similarity to telomeric chromosomal sequences, which is found in promoters of genes up-regulated during the cell cycle (Tremousaygue et al., 2003). The correlated gene set for At3g55960 contains promoters enriched for the ''W-box'' motif. This has been characterized as being essential to the activities of the NPR1 plant defense response induction gene (Yu et al., 2001). Finally, the At3g29760 gene (driver probe 257285) includes the ''Evening Element'', which is a promoter element found in circadian regulated genes (Harmer et al., 2000). In no case was a single element found to be common to all members of a gene set. This could be explained by the presence of multiple gene subsets correlated with each driver, or that the uniting promoter element has yet to be discovered. Root tip 0.428 a A ''highly correlated probe'' is arbitrarily defined as one with P 5 0 and E 5 0, where P represents the probability of such a correlation coefficient arising in the entire microarray data set by chance alone and E represents the number of times a correlation coefficient of the stated value would arise from the entire microarray data set by chance alone (see Arabidopsis Coexpression Data Mining Tools [http://www.arabidopsis.leeds.ac.uk/act/ index.php] for details).
b Correlation values are Pearson correlation coefficients, rounded to three decimal places to save space. c The number of observed probes in each column was analyzed for statistical significance by calculating the probability of a random sampling result using the hypergeometric distribution (this procedure corresponds to the Fisher exact test) as detailed in ''Materials and Methods''. Significant probabilities are indicated; other entries in these columns have nonsignificant probabilities.

The Complete Set of Arabidopsis Protein Phosphatases
In combination with previous work on Arabidopsis (Kerk et al., 2002(Kerk et al., , 2006Schweighofer et al., 2004), the results of this study allow a compilation of the complete inventory of the known various types of protein phosphatase present in this organism. Table I , 2002;DeLong, 2006;Kerk, 2007).

CONCLUSION
As key regulatory enzymes, the presence or absence of any particular protein phosphatase can indicate similarities and differences between species. This analysis for the novel phosphatases has indicated several key differences and similarities in the function of algae, higher plants, and animals. The essential (for animals) cell cycle control enzymes CDC14 and CDC25 seem to have been lost or coopted for different use in higher plants, whereas higher plants have increased their numbers of FCP-like proteins. Other classes, such as the LMWPTPs, SSU72s, and the ubiquitin-like domaincontaining CTD phosphatases, seem remarkably conserved. These data allow insight into the differences and similarities in the function of plants and animals, and how they originated.

Identification of Candidate Protein Phosphatase Homolog Sequences
Representative animal sequences from each structural class were obtained from the published research literature and used as queries in BLASTP searches (Altschul et al., 1997 . Sequences returned from the database with the highest scores and the lowest E values (closest to zero) were examined further. Due to some ambiguity in the CDC14 data, sequence structural similarities were assessed by the ''fold compatibility'' method of comparison to sequences of solved proteins, at the FFAS03 Web site (Rychlewski et al., 2000; http://ffas.ljcrf.edu/ffas-cgi/cgi/ ffas.pl) This method returns standardized variable Z scores-a score of .9 is cited by the authors as being statistically significant. To ensure that no distantly related algal or plant homologs were missed by the initial single query sequence-based BLAST search strategy, the same databases were searched again in a recursive fashion using HMMs constructed from the validated sequence sets from each structural class (see below for details).

Characterization by Multiple Sequence Alignment
The putative protein phosphatase domains of all the candidate homolog sequences for a particular structural subclass, identified in the database search strategy, were placed together in a multiple sequence alignment. The program Muscle (Edgar, 2004) was used, with default parameters. In the case of the DSP CDC14, a reference set of catalytic domains from Arabidopsis DSP proteins was included in the alignment to test whether potential homologs are more closely related to the specific CDC14s or to the general DSP set. A multiple sequence alignment representing the phosphatase domain of each structural subclass was then further examined for characteristic sequence features cited in the research literature, including patterns of conserved critical residues. In some instances additional multiple sequence alignments were also performed with more extensive regions of the protein sequences (i.e. including the nonphosphatase domains) to examine similarity outside the catalytic domain. In the case of the large, heterogeneous FCP1-like sequence set, the final multiple sequence alignment was constructed from a set of smaller subalignments. Each of these was constructed using Muscle, and edited in the sequence display program GeneDoc (Nicholas et al., 1997) to remove poorly aligned regions. This process was guided by evaluation at the T-Coffee Web server (Poirot et al., 2004; http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/ index.cgi). Subalignments were combined, or sequences combined to alignments, using the Profile-Profile or Sequence-Profile alignment features of ClustalX (Thompson et al., 1997; default parameter settings). The various multiple sequence alignments were used to generate HMMs of the proteins of each structural class using the HMMER package (Eddy, 1998;program commands ''hmmbuild'', ''hmmcalibrate'', and ''hmmsearch''). These models were then used to search (threshold E 5 1) through the plant and algal protein databases, and new hits were added to the alignments and scrutinized in the same manner as the original BLAST hits. Sequences lacking known critical active site residues were removed, with the exception of the FCP1-like proteins and the potential CDC14 plant homologs (sequences removed because of a lack of active site conservation are listed in the individual supplemental figure alignments, and sequences included despite lack of active site residues are listed in the legend of Supplemental Fig. S7).

Construction of Phylogenetic Trees
Phylogenetic trees were inferred by the NJ functionality of ClustalX (Thompson et al., 1997; default scoring matrix, ''exclude positions with gaps'' off, ''correct for multiple substitutions'' off); ML, as implemented in TreePuzzle (Schmidt et al., 2002; ''unique topologies'', outgroup specified from the data set, scoring matrix BLOSUM62, 10,000 puzzling trees); and MP, as implemented in PHYLIP (Felsenstein, 1996; randomize sequenced input order and shuffle, multiple data sets [500], other parameters default). NJ topologies were generated as the consensus of 1,000 bootstrap alignment replicates; ML topologies represent the consensus of 10,000 puzzling trees; and Pars topologies represent the consensus of 500 bootstrap alignment replicates. Nodes are presented that exceed 50% support in at least two of the three tree inference procedures.  (Bailey and Elkan, 1995;Bailey and Gribskov, 1998;Bailey et al., 2006). MEME was run with the ''zoops'' model, default motif length, and number of motifs set to 10. Motifs identified in MAST were included if they met the default scoring threshold of P , 0.0001. below). In addition, Arabidopsis microarray data were examined (see next section), as well as MPSS data (Meyers et al., 2004;http://mpss.udel.edu/at/; http://mpss.udel.edu/rice/) from Arabidopsis and O. sativa. Sequences were included from Arabidopsis and O. sativa only if there was a strong hit with a database EST from that species. Because EST representation is so much poorer for the other organisms in this data set, sequences were included lacking a species-specific EST hit if a strong hit was obtained by the query sequence to an EST sequence in another species within the same genus (for example, a P. trichocarpa query sequence returning a strong EST hit in another species of Populus). Because of the dearth of EST data, sequences were also included with no expression data. These candidate homolog sequences are marked as provisional (gray) in Supplemental Table S1.

Mining of Microarray Gene Expression Data
Affimetrix microarray data within the NASC data set (Craigon et al., 2004) were analyzed. Probe identities were obtained from input Arabidopsis Genome Initiative gene numbers at the Arabidopsis Coexpression Data Mining Tools Web site (Jen et al., 2006;Manfield et al., 2006;http://www.arabidopsis. leeds.ac.uk/act/index.php). Analysis of correlated probes was performed using the ''Coexpression Analysis over Available Array Experiments'' option. Tabulated correlation values (Pearson correlation coefficients [r]) were rounded to three decimal places to save space. Also provided by the Web site for each correlated probe is an accompanying ''P value'' (the probability of obtaining an r value of the stated magnitude from the microarray database by chance alone) and an ''E value'' (the number of times an r value of the stated magnitude would be obtained from a random sampling of the microarray database). A probe whose expression is ''highly correlated'' with the given driver probe was arbitrarily defined in a very conservative fashion (to minimize false positives) as one where P 5 0 and E 5 0. The annotations for the top 100 correlated probes were examined for each ''driver'' (e.g. input) probe, and classified into three groups, ''Protein kinases/protein phosphatases'', ''Ubiquitination/Proteolysis System'', and ''Transcription Factors'', based upon sequence annotation (criteria for each group are presented in the next section). Correlated gene sets for each ''driver'' gene probe are presented as Supplemental Table S3. Spatial patterns of gene expression were examined using tools at the Genevestigator Web site (Zimmermann et al., 2005; https:// www.genevestigator.ethz.ch/). The ''Meta-Profile'' option was used to determine sites of maximal gene expression. Table III presents results showing the number of gene probes in each of the three functional groups described above that are highly correlated with driver genes in our data set. To assess the significance of these observations, we used the method described in Jen et al. (2006). The probability of obtaining the stated number (k; given in Table III) of gene probes by chance from a data set containing N total gene probes, with R gene probes of the same functional type as the sample k is given by the hypergeometric distribution. This is given by the density function:

Statistical Determination of ''Overrepresented'' Gene Probes
Pðx; N; R; kÞ 5 CðR; xÞ CðN 2 R; k 2 xÞ=CðN; kÞ; where C(n,m) is the binomial coefficient representing the number of combinations of m objects that can be drawn from a population of n objects. Obtaining this probability is the equivalent of the Fisher exact test. We performed the calculation using the ''HYPGEOMDIST'' function of MS Excel. A value of P , 0.01 was deemed to be statistically significant. Values for the parameters N and R were obtained as detailed below.

Generation of Probe Lists for Functional Protein Classes
Affymetrix gene probe sets were downloaded from the ACT Web site and purged of duplicates arising from cross-hybridization. This resulted in a large ''22K'' probe set containing N 5 21,890 probes, and a small ''8K'' probe set containing N 5 6,134 probes. These files were then searched for annotation text features corresponding to three functional protein classes (see below), resulting in probe lists. Each probe set was then purged of duplicates, with the result that nonredundant probe lists were generated for ''protein kinases/ phosphatases'' (R 5 1,084 for the large probe set, R 5 338 for the small probe set), ''ubiquitin/proteolysis proteins'' (R 5 790 for the large probe set), and ''transcription factors'' (R 5 1,040 for the large probe set; the small probe set was only utilized for the first functional class). Text search terms for each functional gene probe class were as follows: ''protein kinases/phosphatases'' (''protein kinase'', ''protein phosphatase''); ''ubiquitin/proteolysis proteins'' (''ubiquitin-specific

Supplemental Data
The following materials are available in the online version of this article. Figure S1. Full-length alignment of SSU72-like proteins. Figure S2. Alignment of CDC14-like catalytic domains. Figure S3. Full-length alignment of CDC14-like proteins. Figure S4. Alignment of CDC25-like catalytic domains. Figure S5. Full-length alignment of CDC25-like proteins. Figure S6. Alignment of LMWPTP phosphatase domains. Figure S7. Alignment of FCP1-like phosphatase domains. Figure S8. Full-length alignment of CPL-like proteins. Figure S9. Alignment of EYA-like phosphatase domains. Figure S10. Alignment of Chronophin-like phosphatase domains.

Supplemental
Supplemental Table S1. Gene expression evidence summary.
Supplemental Table S2. Promoter analysis summary table.