CiliaCarta: An integrated and validated compendium of ciliary genes

The cilium is an essential organelle at the surface of mammalian cells whose dysfunction causes a wide range of genetic diseases collectively called ciliopathies. The current rate at which new ciliopathy genes are identified suggests that many ciliary components remain undiscovered. We generated and rigorously analyzed genomic, proteomic, transcriptomic and evolutionary data and systematically integrated these using Bayesian statistics into a predictive score for ciliary function. This resulted in 285 candidate ciliary genes. We generated independent experimental evidence of ciliary associations for 24 out of 36 analyzed candidate proteins using multiple cell and animal model systems (mouse, zebrafish and nematode) and techniques. For example, we show that OSCP1, which has previously been implicated in two distinct non-ciliary processes, causes ciliogenic and ciliopathy-associated tissue phenotypes when depleted in zebrafish. The candidate list forms the basis of CiliaCarta, a comprehensive ciliary compendium covering 956 genes. The resource can be used to objectively prioritize candidate genes in whole exome or genome sequencing of ciliopathy patients and can be accessed at http://bioinformatics.bio.uu.nl/john/syscilia/ciliacarta/.

Cilia are microtubule--based organelles extending from the surface of most eukaryotic cells, serving 1 critical functions in cell and fluid motility, as well as the transduction of a plethora of sensory and 2 biochemical signals associated with developmental processes 1 . Cilium disruption leads to a wide 3 range of human disorders, known as ciliopathies, characterized by defects in many different tissues 4 and organs leading to symptoms such as cystic kidneys, blindness, bone malformation, nervous 5 system defects and obesity. The cilium is a complex and highly organized structure, typically 6 comprised of a ring of nine microtubule doublets extending from a centriole--derived basal body, 7 and enveloped by an extension of the plasma membrane. Importantly, cilia are compartmentalized 8 structures, with a membrane and internal composition that differs significantly from that of the 9 plasma membrane and cytoplasm 2,3 . Several hundred proteins are thought to be involved in the 10 formation and function of ciliary structures and associated signaling and transport pathways. 11 Between Gene Ontology (GO) 4 and the SYSCILIA Gold Standard (SCGS) 5 about 600 genes have 12 already been associated with ciliary function. However, it is likely that many more ciliary proteins 13 remain to be functionally characterized, as new ciliary genes are uncovered on a regular basis, often 14 in relation to cilia--related genetic disorders. 15 Although omics data sets provide a rich source of information to determine the proteins that 16 constitute an organelle, they are inherently imperfect. They tend to be biased towards proteins with 17 specific properties, or lack coverage and sensitivity. It is well established that large scale 18 approaches often miss key players, for example proteomics fares better at finding intracellular 19 versus extracellular or membrane proteins 6 . Combining data sets into a single resource that 20 exploits their complementary nature is a logical step. The power of such an approach has been 21 demonstrated for various cellular systems such as small RNA pathways 7 , the innate antiviral 22 response 8 , and, most notably, by MitoCarta 9 . The latter is a compendium of mitochondrial proteins 23 based on the integration of various types of genomics data that has been extensively used by the 24 biomedical community 10 . 25 Like the mitochondrion, the cilium has been subject to approaches that exploit signals in genomics 26 data to predict new ciliary genes, e.g. the specific occurrence of genes in species with a cilium 11,12 , 27 the sharing of specific transcription factor binding sites like the X--box motif 13 and the 28 spatiotemporal co--expression of ciliary genes 14 . Furthermore, an existing database, CilDB, contains 29 data from individual genomics, transcriptomics, and proteomics experiments related to the 30 cilium 15 . Although such databases are invaluable to the researcher, it is not obvious how to handle 31 the sometimes--conflicting information presented by independent data sources: i.e. how to weigh 32 the data relative to each other. A powerful solution lies in statistical and probabilistic integration of 33 data sets. Multiple methods exist to combine genomics data, ranging from simply taking the genes 34 that are predicted by most data sources, to machine learning methods that take into account non--35 linear combinations of the data (reviewed in 16 ). In this spectrum, naive Bayesian classifiers take a 36 middle ground. They exploit the relative strengths of the various data sets while maintaining 37 transparency of the integration. For each gene, the contribution of each data set to the prediction 1 can be determined, and new, independent data sources can then simply be added to those already 2 used to improve predictive value and coverage. 3 Here, we present CiliaCarta, an experimentally benchmarked compendium of ciliary genes based on 4 literature, annotation, and genome--wide Bayesian integration of a wide range of experimental data, 5 including a recent large--scale protein--protein interaction data set specifically focused on ciliary 6 proteins 17 . We extend the currently known set of cilia--related genes with 228 putative cilia--related 7 genes to a total of 836 genes. Based on the outcome of the probabilistic integration and the results 8 from the experimental validations, we estimate the total size of the human ciliome to be 9 approximately 1200 genes. Furthermore, we show that objective data integration using Bayesian 10 probabilities is capable of overcoming biases based on other sources or literature. As an exemplar 11 of our approach, we show that OSCP1, which several publications have described as a solute carrier 12 or tumor suppressor, is also a ciliary component, shedding new light on previous observations. 13

14
Data collection and curation 15 We collected and constructed a total of six new data sets from proteomics, genomics, expression 16 and evolutionary data and complemented these with two public data sets ( Table 1). The data sets 17 include three new protein--protein interaction (PPI) data sets based on three methods: tandem-- 18 affinity purification coupled to mass spectrometry (TAP--MS) 17 , stable isotope labeling of amino 19 acids in cell culture (SILAC), and yeast two--hybrid (Y2H) screens. The latter two data sets are 20 published as part of this work and include 1666 proteins (1301 and 365, resp.) describing 4659 21 interactions (4160 and 499, resp.) and an estimated positive predictive value of X. Because the TAP--22 MS and SILAC data sets are based on similar methodology and have a large bait overlap (14 out of 23 16 SILAC baits were used in the TAP--MS data set), we merged them into a single data set (Mass--24 spec based PPI) to avoid bias (see methods). In addition, we created three bioinformatic data sets: 25 (i) a data set describing the presence or absence of conserved cilia--specific RFX and FOXJ1 26 transcription factor binding sites (TFBS) in the promoter regions of human genes based on the 29 27 mammals project 18 , (ii) a gene expression screen for genes that co--express with a set of known 28 ciliary genes, and (iii) a comprehensive co--evolution data set from the presence--absence 29 correlation patterns of genes with the presence of the cilium over a representative data set of 30 eukaryotic species. Supplementing these six data sets we included two published large--scale data 31 sets, covering primary cilia and motile cilia: (i) the Liu et al. data set is a proteomics data set 32 describing the protein content of sensory cilia derived from isolated murine photoreceptor cells 19 ; 33 (ii) the Ross et al. data set is a dynamic gene expression data set describing the up--regulation of 34 genes during ciliogenesis in a time series after shearing off cilia in human lung epithelial cells 20 . 35 Although both data sets are not among the strongest predictors for ciliary function, they are of good 1 quality and, importantly, have a much higher coverage of the human proteome than other 2 published datasets (Supplementary fig. 1). In addition, the eight data sets were selected to ensure 3 independence of the types of evidence and comprehensive coverage of molecular signatures for 4 ciliary genes (for more details see methods). Each data set contains a highly significant signal for 5 the discovery of ciliary genes (Table 1, p--values ranging from 2.1E--14 to 2.4E--69). A full description 6 on each data set is given in the methods section. 7 Bayesian integration of omics data sets provides gene--specific probabilities for 8 cilia involvement 9 By combining the complementary sources of evidence for cilium function, we can in principle, 10 obtain a data--driven, objective, high--confidence compendium of ciliary genes. To integrate the data 11 sets in a probabilistic manner we assessed their efficacy at predicting ciliary genes using the SCGS, a 12 manually curated set of known ciliary genes (the 'positive' set) 5 , and a set of genes whose proteins 13 show non--ciliary subcellular localization and are most likely not involved in ciliary function (the 14 'negative' set, see methods). We divided each data set into appropriate sub--categories that reflect 15 increasing propensities to report ciliary genes (Fig. 1a). Then, for each data category we calculated 16 the likelihood ratios of predicting ciliary genes versus predicting non--ciliary genes using Bayes ' 17 theorem. The final log--summed likelihood ratio, which includes the prior expected probability to 18 observe a ciliary gene in the genome, we here call the CiliaCarta Score. This score therefore 19 represents the likelihood for a gene to be ciliary, based on all the data sets considered (see 20 methods; CiliaCarta and individual data set scores can be found in Supplementary table 1). The  21 integrated score readily distinguishes between the positive and negative set (Fig. 1b, p--value: 2.8e--22 85 Mann--Whitney U test). A ten fold cross--validation demonstrates that the results are highly 23 robust, with an area under the curve of 0.86 ( Supplementary Fig. 2). The top ranking genes are 24 highly enriched for known ciliary genes (Fig. 1c). The Bayesian classifier outperforms any 25 individual data set, while achieving genome--wide coverage (Fig. 1d). 26 Experiments validate ciliary function for 67% of selected candidate genes 27 In order to validate the quality of our approach we performed a series of experimental tests for 28 ciliary function or localization of newly predicted candidate genes and their proteins. We set our 29 inclusion threshold at a False Discovery Rate of 25% (cFDR, corrected for the prior expectation to 30 observe a cilium gene, Methods). 404 genes fall within this threshold, 285 of which were not in the 31 training sets and thus constitute novel candidate ciliary genes. Eight genes from the negative set 32 (HSPH1, CALM3, COL21A1, CALM1, PTGES3, HSPD1, COL28A1, and PPOX) occur in the top 404 33 genes. Literature searches revealed no previous connection to the cilium. Some high scoring non--34 ciliary genes are expected due to stochasticity in the underlying experimental data, but it is possible 35 that some of these genes will turn out to have a ciliary function in the future. The FDR for the 285 36 candidate genes is 33% when excluding genes from the training sets (Methods). From the 1 candidates we selected a total of 36 genes, spread evenly across the top predictions, for 2 experimental validation that were not known to be ciliary at the time ( Table 2). The gene selection 3 was performed as unbiased as possible and was restricted by orthology (zebrafish, C. elegans) and 4 the resources available to the participating labs (see methods). We performed validations in 5 human, mouse, zebrafish and C. elegans by applying six distinct approaches to determine ciliary 6 localization and function. 7 Validation by phenotype 8 In C. elegans we investigated candidate gene associations with cilium formation and function in vivo. 9 In the C. elegans hermaphrodite (959 cells), cilia are found on 60 sensory neurons, most of which 10 are located in the head. These cilia are immotile, eight of which extend from the dendritic endings, 11 forming sensory pores in the nematode cuticle 21 . 68 of the 285 candidate genes have one--to--one 12 orthologs in C. elegans. The apparent low number of orthologs in C. elegans could have a number of 13 reasons, one of which is the absence of motile cilia, which resulted in the loss of ciliary genes 14 related to motility. For 21 of these genes, viable alleles were available from the Caenorhabditis 15 Genetics Center (University of Minnesota, USA) (Supplementary table 2). The available alleles were  16 all nonsense or deletion mutations and predicted to severely disrupt gene function. For each 17 available mutant, we employed dye--filling assays --to indirectly assess the integrity of a subset of 18 cilia (six amphid pairs and one phasmid pair) --foraging (roaming) and osmotic avoidance assays to 19 investigate sensory behaviors 22, 23 . Notably, none of the mutants showed an abnormal osmotic 20 avoidance response, or a highly penetrant dye uptake phenotype (Supplementary table 2), implying 21 that the associated genes are not involved in global regulation of cilium formation or function. In 22 addition all worm mutants had normal physical appearance. Roaming defects were observed for 23 nine mutants, three of which also displayed a mild defect in dye uptake ( Fig. 2a  in the Kupffer's vesicle was examined (Fig. 3a). rab36 and ttc18 morphants show significantly 28 reduced cilia length and number in the Kupffer's vesicle (p<0.001 and p<0.05 resp.), while the 29 srgap3 morphant shows an increased cilia length (p<0.001), but no effect on cilia number. 30 Pronephric ducts are several orders of magnitude larger in all three morphants during the 31 pharyngula period (24 hpf) compared to wild type (Fig. 3b, p--value < 0.001). Furthermore, all three 32 morphants exhibit body--axis defects associated with cilium dysfunction 24,25 (Fig. 3c). Although, 33 zebrafish morpholino phenotypes, and C. elegans phenotypes, do not provide definitive proof that 34 these genes are all ciliary, they provide quantitative support for a substantial enrichment of ciliary 35 genes in the CiliaCarta. 36 Validation by subcellular localization 1 Subcellular localization studies were performed for a total of nine proteins in ciliated hTERT--RPE1 2 cells using eCFP--tagged overexpression constructs ( Table 2, Supplementary Fig.4). Eight proteins 3 localized to the basal body and/or the axoneme. To account for possible localization artifacts, two 4 representative photos are taken per eCFP fusion protein, containing at least one ciliated cell from 5 the same slide. Cells transfected for c15orf22::eCFP and c16orf80 ciliated only when expression of 6 the eCFP fusion protein was low. Fig. 4a shows representative examples, with C20orf26 and 7 CCDC147 localizing to the basal body, whilst IQCA1 was not enriched at the cilium. 8 Four proteins (RIBC2, ARMC3, CCDC113 and C6orf165) were investigated for ciliary localization of 9 the endogenous protein by immunofluorescence microscopy. RIBC2, ARMC3 and CCDC113 were 10 observed to co--localize with acetylated alpha tubulin, a marker for the ciliary axoneme in human 11 respiratory cells, while C6orf165 specifically localized to the base of the cilium (Fig. 4b) . We also 12 tested these four proteins in other model systems to investigate if their ciliary localization is model--13 specific. Location was validated in murine retinal sections by immunofluorescence and at the 14 ultrastructural level via immunogold electron microscopy ( Fig. 4c--f). All four proteins associated 15 with the sensory cilia of photoreceptor cells. However, in hTERT--RPE1 cells only RIBC2 showed 16 localization at the basal body suggesting that ciliary localization can be cell type specific (Fig. 4g). 17 Concurrent to our efforts other labs recently identified ciliary involvement for RIBC2 26 and 18 CCDC113 27 . 19 Overall validation performance 20 Combining the results of all the validation experiments, we observed a ciliary localization or 21 putative cilium--associated phenotype for 24 of the 36 candidates tested ( Table 2). There appears to 22 be a notable performance difference between the localization--and phenotype--based validation 23 assays (PPV of 0.92 and 0.56 respectively, Fig. 4h). This difference is likely attributable to the fact 24 that knockdown of a ciliary gene does not necessarily lead to an observable phenotype. Although 25 we should keep in mind that the nematode and zebrafish ciliary phenotypes require confirmation 26 via rescue experimentation, the overall experimentally determined positive predictive value (PPV) 27 of the Bayesian classifier will be within 40 to 70%, approaching 67% depending on conservative or 28 optimistic interpretations of the presence of false positives or false negatives in our validations (Fig.  29 4h). 30 The observed experimental PPV corresponds with the theoretical PPV of 67%, which corresponds 31 to the FDR of 33% calculated for the candidate genes from the integrated CiliaCarta score (expected 32 validation rate: 24.1 out of 36, p(x=24 given 33% FDR)=0.15, hypergeometric test, Fig. 4h). Even if 33 we assume the lowest PPV of 40%, the experimental validation rate is significantly higher than 34 expected by chance, based on the estimated prior distribution of ciliary versus non--ciliary genes in 35 the human genome (expected 1.8 out of 36, p=3.34e--23 for PPV=67% and p=7.64e--10 for 1 PPV=40%; hypergeometric test). The experimental verification of our candidate genes therefore 2 validates our Bayesian classifier. 3 OSCP1, a novel ciliary protein 4 Unbiased integration of large--scale genomics data can give rise to apparent inconsistencies with 5 previous literature reports. For instance, organic solute carrier partner 1 (OSCP1, or oxidored--nitro 6 domain--containing protein 1, NOR1), scores high on our CiliaCarta list (ranked 402), despite 7 reports of varied functions not obviously consistent with a ciliary role, such as regulation of 8 inflammation, apoptosis, proliferation and tumor suppression [28][29][30] . OSCP1 was first implicated as a 9 tumor suppressor in nasopharyngeal cancer 28 and was later found to modulate transport rates of 10 organic solutes over the plasma membrane in rodents 31,32 . These two facets of OSCP1 function have 11 never been connected to each other in literature. Subsequent independent research indicated a role 12 for OSCP1 in regulation of inflammation and apoptosis 29 , and that it is specifically expressed in 13 mouse testes 33 . The availability of a substantial set of literature on OSCP1 without any indication of 14 ciliary involvement as predicted by our method, would have suggested that OSCP1 could have been 15 a false positive in our Bayesian method. However, our C. elegans phenotype screen suggests that C. 16 elegans oscp--1 (R10F2.5) mutant alleles may possess modest sensory cilia defects ( Fig. 2a & b). 17 Therefore, we targeted OSCP1 for more detailed investigation of OSCP1 in C. elegans, zebrafish and 18 mammalian cells. ciliated cells, indicating a basal body/centriole role for OSCP1 in this epithelial cell type (Fig. 5b). In 27 some cells a clear punctate localization in the cytoplasm can also be observed. 28 To assess the localization of the endogenous OSCP1 protein in various mammalian cells, we used 29 commercially available antibodies from three suppliers (Proteintech, Atlas and Biorbyt). In human 30 multi--ciliated respiratory cells, OSCP1 shows an increased concentration at the base of the cilia (Fig.  31 5c). In murine photoreceptors OSCP1 is localized at the inner segments (Fig. 5d), where it is 32 particularly abundant at the base of the connecting cilium (the equivalent of the transition zone in 33 primary cilia), and at low concentration at the adjacent centriole. Immunoelectron microscopy 34 analysis also shows OSCP1 at the basal body, occasionally at the connecting cilium and sporadically 35 within the inner segment (Fig. 5d). Serum--starved IMCD3 cells (murine collecting duct cells) 1 showed ubiquitous punctate cytoplasmic staining with increased signal at the basal body and along 2 the ciliary axoneme ( Fig. 5e; Supplementary Fig. 6). In ATDC5 cells (murine pre--chondrocyte cells) 3 OSCP1 is mainly localized to the ciliary axonemes, although the ciliary base signal is less apparent 4 ( Supplementary Fig. 6). 5 Together, these exhaustive subcellular localization analyses place OSCP1 at ciliary structures, 6 including the ciliary base and axoneme, although there are some subtle differences between cell 7 types. We also frequently observed extensive non--ciliary signals for OSCP1, consistent with 8 published reports for OSCP1 localizations at other organelles (ER, Golgi, Mitochondria) and within 9 the cytosol 34,35 . 10 OSCP1 is required for cilium formation in multiple zebrafish tissues, but dispensable in C. elegans 11 sensory neurons 12 To investigate possible ciliogenesis roles for OSCP1, we first examined cilia in zebrafish morphants 13 depleted for oscp1 in fertilized eggs. In embryos at 2 days post fertilization (dpf) developmental 14 phenotypes were observed that are often associated with ciliary defects: a curved body axis, small 15 eyes and melanocyte migration defects (Fig. 5f). Likewise, the 4 dpf morphants presented with 16 pronephric cysts, small eyes, heart edema, small heads and short bodies ( Fig. 5f & g). The cilia in the 17 medial portion of the pronephric ducts of oscp1 morphants were shortened and disorganized ( Fig.  18 5h&i), and in the Kupffer's vesicle cilium length and number was decreased compared to control 19 injected larvae (Fig. 5j). Co--injecting human OSCP1 mRNA together with the morpholino partially 20 rescued the observed phenotype, indicating that the observed phenotype is specific for loss of oscp1 21 function ( Fig. 5k and Supplementary Fig. 7). Therefore, in zebrafish, oscp1 is required for cilium 22 formation and associated functions in many tissue types and organs. 23 In contrast, analysis of the oscp--1(gk699) null allele in C. elegans revealed that OSCP--1 is not 24 required for cilium formation. Using fluorescence reporters and transmission electron microscopy, 25 the amphid and phasmid channel cilia appeared to be intact, and full length (Supplementary in the nematode, but not in zebrafish, or differences in cell type requirements (nematode sensory  33 neurons versus zebrafish epithelial cells). Clearly, based on its restricted expression in ciliated cells 34 and localization with the ciliary axoneme, C. elegans OSCP--1 is serving a ciliary function, although 35 the specifics of this role remain to be elucidated. 36 With the current interest in cilia biology it is certain that many new genes will be implicated in 2 ciliary function for several years to come. With the advent of systems biology and the need to 3 understand the cilium as a whole the research community requires an inventory of genes and 4 proteins involved in ciliary structure and function. The obvious sources for such an inventory, GO 4 5 and the SCGS 5 , currently only cover 608 human genes (510 in GO as of December 3rd 2015, 302 for 6 the SCGS). By applying a naive Bayesian integration of heterogeneous large--scale ciliary data sets 7 we have expanded this set by 38% to 836 human genes, adding 228 putative genes to the known 8 cilium gene repertoire ( Supplementary Fig. 8). GO term enrichment analysis indicates that these 9 putative ciliary genes are enriched for genes that are "unclassified" (86 genes, 1.69 fold enrichment, 10 p--value < e--100) suggesting possible new ciliary biology to be discovered. We put forward these 11 putative ciliary genes, together with the SCGS and the GO annotated genes, as the "CiliaCarta", a 12 compendium of ciliary components with an estimated FDR of 10% (Supplementary table 3). This 13 community resource can be used to facilitate the discovery of new cilium biology and to identify the 14 genetic causes of cilia related genetic disease. CiliaCarta therewith has a very different purpose than 15 the SCGS, it serves as a tool for discovery of new genes, rather than as a reference of known cilium 16 genes. 17 CiliaCarta is still likely to be incomplete. Estimates of the ciliary proteome range from one to two 18 thousand proteins depending on the techniques used and on the types of cilia and the species 19 studied 19,37 . In addition to obtaining the CiliaCarta list of proteins, our Bayesian analysis allows us 20 to obtain an objective estimate of the total number of ciliary proteins. Using the posterior 21 probabilities and the outcome from the validation experiments we estimate the size of the ciliome 22 to be approximately 1200 genes (Methods). 23 Predicting the genes responsible for an organelle structure or function poses the question of where 24 we draw the boundary between that organelle and the rest of the cell. For example, does the basal 25 body in its entirety belong to the cilium or are some components to be considered exclusive to the 26 centrioles? Furthermore, one can argue that in the case of the cilium, proteins that are not part of 27 the cilium, but play a role in the transport of proteins to the ciliary base or regulation of cilium gene 28 expression, can be regarded as components of the 'ciliary system'. In practice, what we regard here 29 as a ciliary component depends on the definition within the SCGS, which is used to weigh the data 30 sets, and on our experimental validations. In both we have taken a rather inclusive approach by 31 regarding genes whose disruption cause ciliary phenotypes as ciliary genes. This approach makes 32 our predictions relevant to human disease. Indeed KIAA0753, which falls within our 25% cFDR list 33 of cilium genes, was recently shown to interact with OFD1 and FOR20 at pericentriolar satellites 34 and centrosomes, and gives rise to oral--facial--digital syndrome 38 , a phenotype associated with 35 disruption in the ciliary transition zone and the basal body. 36 Data integration through objective quantification of the predictive value of individual large--scale 1 data sets allows one to find new functions and associations, without bias from previous studies. As 2 a case in point we report here that OSCP1, not previously implicated in ciliary functions based on 3 the existing published literature, is validated as a ciliary protein in four species as determined by 4 multiple independent experimental methods. Re--evaluation of previous experimental evidence on 5 OSCP1 function does not exclude ciliary involvement. Nevertheless, the cilium contains several 6 specific ion--channels in its membrane 39 and the organelle has been implicated to play a role in the 7 development of cancer 40-42 . Therefore connecting OSCP1 to the cilium might provide the missing 8 link to connect previously observed effects of OSCP1 on organic solute in--/efflux and its role in 9 nasopharyngeal cancer. Our results therefore provide a cellular target and biomolecular framework 10 to further unravel OSCP1 function. 11 Systematic integration of heterogeneous large--scale cilia data sets by employing Bayesian statistics 12 combined with medium throughput experimental validation is a powerful approach to identify 13 many new ciliary genes and provides a molecular definition of the cilium. The experimental 14 observations of a potential ciliary role for selected high--confidence candidates, together with the 15 results from the cross validation, indicates that the top tier of the entire ranked human genome is 16 highly enriched for ciliary genes. The genome--wide CiliaCarta Score and ranking as provided here, 17 should therefore make it possible to efficiently and objectively prioritize candidate genes in order 18 to discover new ciliary genes and ciliary functions. 19 Resources using other identifiers (i.e. Entrez, Uniprot) were mapped to ENSEMBL gene IDs using 23 ENSEMBL BioMART (version 71). Orthologs from non--human data sets were mapped using the 24 ENSEMBL Compara ortholog catalogue from ENSEMBL 71, with exception of Sanger sequences 25 from Y2H screens based on the Bovine cDNA library that were mapped to Bovine genome 26 sequences by the BLAT tool 44 in the UCSC genome browser 45 and subsequently mapped to human 27 orthologs using the 'non--cow RefSeq genes track' from this browser. 28

29
Bait protein selection was based on the association of proteins with ciliopathies (including mutant 30 vertebrates showing ciliopathy features), involvement in IFT or part of our candidate list of ciliary 31 proteins. Gateway--adapted cDNA constructs were obtained from the Ultimate TM ORF clone 32 collection (Thermo Fisher Scientific) or generated by PCR from IMAGE clones (Source BioScience) 33 or human marathon--ready cDNA (Clontech) as template and cloning using the Gateway cloning 34 system (Thermo Fisher Scientific) according to the manufacturer's procedures followed by 1 sequence verification. 2 Yeast two--hybrid system 3 A GAL4--based yeast two--hybrid system was used to screen for binary protein--protein interactions 4 with proteins expressed from several different cDNA libraries (see below) as described 5 previously 46 . Yeast two--hybrid constructs were generated according to the manufacturer's 6 instructions using the Gateway cloning technology (Thermo Fisher Scientific) by LR recombination 7 of GAL4--BD Gateway destination vectors with sequence verified Gateway entry vectors containing 8 the cDNA's of selected bait proteins. 9 Constructs encoding full--length or fragments of bait proteins fused to a DNA--binding domain 10 (GAL4--BD) were used as baits to screen human oligo--dT primed retinal, brain (Human Foetal Brain 11 Poly A+ RNA, Clontech), kidney (Human Adult Kidney Poly A+ RNA, Clontech) or testis cDNA 12 libraries, or a bovine random primed retinal cDNA library, fused to a GAL4 activation domain 13 (GAL4--AD). The retina and testis two--hybrid libraries were constructed using HybriZAP--2.1 14 (Stratagene), the brain and kidney two--hybrid libraries were constructed using the "Make Your 15 Own Mate & Plate™ Library System" (Clontech). 16 The yeast strains PJ96--4A and PJ96--4α (opposing mating types), which carry the HIS3 (histidine), 17 ADE2 (adenine), MEL1 (α--galactosidase), and LacZ (β--galactosidase) reporter genes, were used as 18 hosts. Interactions were identified by reporter gene activation based on growth on selective media 19 (HIS3 and ADE2 reporter genes), α--galactosidase colorimetric plate assays (MEL1 reporter gene), 20 and β--galactosidase colorimetric filter lift assays (LacZ reporter gene). 21

DNA constructs and cell culture 23
Experiments were essentially performed as described before 47 . In short, N--terminally SF--TAP--24 tagged bait proteins that were obtained by LR recombination of TAP--destination vector with 25 sequence verified Gateway entry vectors containing the cDNA's of selected bait proteins using 26 Gateway cloning technology (Thermo Fisher Scientific). HEK293T cells were seeded, grown 27 overnight, and then transfected with SF--TAP--tagged bait protein constructs using Effectene 28 (Qiagen) according to the manufacturer's instructions. HEK293T cells were grown in SILAC cell 29 culture medium as described 47 . 30

Affinity purification of protein complexes 31
For one--step Strep purifications, SF--TAP-tagged proteins and associated protein complexes were 32 purified essentially as described previously 47 . In short, SILAC labeled HEK293T cells, transiently 33 expressing the SF--TAP tagged constructs were lysed in lysis buffer containing 0.5% Nonidet--P40, 1 protease inhibitor cocktail (Roche), and phosphatase inhibitor cocktails II and III (Sigma--Aldrich) in 2 TBS (30 mM Tris--HCl, pH 7.4, and 150 mM NaCl) for 20 minutes at 4°C. After sedimentation of 3 nuclei at 10,000 g for 10 minutes, the protein concentration was determined using a standard 4 Bradford assay. Equal protein amounts were used as input for the experiments to be compared. The 5 lysates were then transferred to Strep--Tactin--Superflow beads (IBA) and incubated for 1 hour 6 before the resin was washed 3 times with wash buffer (TBS containing 0.1% NP--40 and 7 phosphatase inhibitor cocktails II and III). The protein complexes were eluted by incubation for 10 8 minutes in Strep--elution buffer (IBA). The eluted samples were combined and concentrated using 9 10--kDa cutoff VivaSpin 500 centrifugal devices (Sartorius Stedim Biotech) and prefractionated 10 using SDS--PAGE and in--gel tryptic cleavage as described elsewhere 48 . 11 Quantitative mass spectrometry 12 After precipitation of the proteins by methanol--chloroform, a tryptic in--solution digestion was 13 performed as described previously 49 . LC--MS/MS analysis was performed on a NanoRSLC3000 Remaining peptides were eluted by a short gradient from 35% to 95% buffer B in 5 minutes. The 24 eluted peptides were analyzed by using a LTQ Orbitrap XL, or a LTQ OrbitrapVelos mass 25 spectrometer. From the high--resolution mass spectrometry pre--scan with a mass range of 300-26 1,500, the 10 most intense peptide ions were selected for fragment analysis in the linear ion trap if 27 they exceeded an intensity of at least 200 counts and if they were at least doubly charged. The 28 normalized collision energy for collision--induced dissociation was set to a value of 35, and the 29 resulting fragments were detected with normal resolution in the linear ion trap. The lock mass 30 option was activated and set to a background signal with a mass of 445.12002 50 . Every ion selected 31 for fragmentation was excluded for 20 seconds by dynamic exclusion. 32 For quantitative analysis, MS raw data were processed using the MaxQuant software 51 (version 33 1.5.0.3). Trypsin/P was set as cleaving enzyme. Cysteine carbamidomethylation was selected as 34 fixed modification and both methionine oxidation and protein acetylation were allowed as variable 35 modifications. Two missed cleavages per peptide were allowed. The peptide and protein false 36 discovery rates were set to 1%. The initial mass tolerance for precursor ions was set to 6 ppm and 1 the first search option was enabled with 10 ppm precursor mass tolerance. The fragment ion mass 2 tolerance was set to 0.5 Da. The human subset of the human proteome reference set provided by 3 SwissProt (Release 2012_01 534,242 entries) was used for peptide and protein identification. 4 Contaminants like keratins were automatically detected by enabling the MaxQuant contaminant 5 database search. A minimum number of 2 unique peptides with a minimum length of 7 amino acids 6 needed to be detected to perform protein quantification. Only unique peptides were selected for 7 quantification. 8 Protein--protein interaction data processing 9 Based on three ciliary protein--protein interaction (PPI) data sets, we inferred proteins to "interact 10 with ciliary components" as a proxy for being part of the cilium. First, we obtained protein complex 11 purification data from a large--scale study on the identification of ciliary protein complexes by 12 tandem--affinity purification coupled to mass spectrometry (TAP--MS) for 181 proteins known or 13 predicted to be involved in ciliary functions 17 . Since integration of this data set into the CiliaCarta 14 many pull--downs were repeated and reverse experiments included. As a result, our data set 15 includes 539 found interactors that are not part of the now published final data set (4702 proteins, 16 The Y2H, SILAC and TAP--MS data were transformed to genome wide data sets by defining genes as 1 1 ("found") when the gene product was found to interact with the baits, and as 0 ("not found") 2 when the gene product was not found to interact. Due to the large overlap in baits and the largely 3 similar methods used in the TAP--MS and SILAC data sets we decided to combine the data sets in 4 order to avoid counting the interacting proteins multiple times and thereby artificially 5 overestimating their CiliaCarta Scores. The mass--spectrometry data sets and Y2H data sets were 6 found to be sufficiently different to include them as separate data sets (positive set correlation is 7 0.13, Supplementary Fig. 10). 8 Expression screen data set 9 In expression screening 63 separate gene--expression data sets are weighted for their potential to 10 predict new genes for a system by measuring, per data set, the level of co--expression of the known 11 genes. We have already successfully applied this method to predict TMEM107 as part of the ciliary 12 transition zone 64 and now extend the approach to the complete cilium. An integrated cilium co--13 expression data set was constructed by applying the weighted co--expression method WeGet 65 to 14 ciliary genes in 465 human expression data sets available in the NCBI Gene Expression Omnibus 66 . 15 For individual genes, correlations of their expression profiles were determined with expression 16 profiles of the set of ciliary components. The contribution of each data set to a final co--expression 17 score per gene was weighed by how consistently the set of cilia components were expressed 18 together, i.e. how well the data set in question is able to detect ciliary components 65 . To avoid 19 circularity with the training of the Bayesian classifier the expression screen was performed using a 20 gene set of ciliary components from GO (GO:Cilium) and removed any overlap from the positive set 21 used to evaluate this data set for the Bayesian classifier. The data set from Ross et al. 20 , which has 22 been included in the Bayesian classifier, has been excluded from the microarray data sets used in 23 the expression screen. 24 Ciliary co--evolution data set 25 Given the large number of independent losses of the cilium in eukaryotic evolution (we counted 26 eight independent loss events throughout the eukaryotic kingdom) 67-69 , presence/absence profiles 27 have a high value for predicting new cilium genes 12,64 . We constructed a comprehensive co--28 evolution data set from a comparative genomics analysis of presence--absence correlation patterns 29 over a representative data set of eukaryotic species. We correlated the occurrence of orthologs of 30 22,000 human genes in 52 eukaryotic genomes to that of cilia or flagella using differential Dollo 31 parsimony (DDP) 70 . A perfectly matched profile pair would obtain a DDP of 0 (all events match, no 32 differences), while mismatching profile pairs would receive a DDP equal to the number of 33 evolutionary events that did not occur at the same time in evolution (e.g. the gene was lost in a 34 lineage still maintaining a cilium, or the gene was maintained in a lineage in which the cilium has 35 been lost). Thus we obtained an objective measure for each human gene that describes how well its 36 evolutionary trajectory (i.e. point of origin and independent loss events) matches that of the ciliary 1 system. A number of mismatches are expected for some ciliary genes; Plasmodium falciparum for 2 instance has maintained a cilium, but has lost all genes of the IFT machinery. Due to the topology of 3 the species tree we observed a complicated distribution of genes from the positive and negative 4 sets ( Supplementary Fig. 11a): for low DDP scores (0--6) we did not observe a single negative gene, 5 which would result in unrealistic log odd scores (i.e. infinity). To avoid these unrealistic log odds, 6 we decided to combine the DDP scores into two categories, namely genes with a DDP ≤ 9 and genes 7 with a DDP ≥ 10. Genes with a score between 0 and 9 were generally overrepresented among ciliary 8 genes ( Supplementary Fig. 11b). 9 Transcription factor binding sites data set 10 The RFX and FOXJ1 transcription factors play an important role in the regulation of ciliogenesis 71, 72 . 11 X--box (RFX) or a FOXJ1 transcription factor binding site (TFBS) have been used to predict novel 12 ciliary genes in Caenorhabditis elegans (nematode) and Drosophila melanogaster (fruit fly) 73-75 . We 13 processed the publicly available data sets from the 29 mammals project 18 to obtain human genes 14 with a conserved X--box or FOXJ1 TFBS in their promoters, which were defined as 4 kilobase (kb) 15 windows centered (i.e. 2kb upstream and 2kb downstream) at all annotated transcription start 16 sites of the gene. The restriction that the TFBS motifs are conserved among mammalian species 17 infers a higher level of confidence that these motifs are indeed relevant and not spurious hits. The 18 final data set was constructed by defining two categories, namely: "Gene has a X--box and/or FOXJ1 19 TFBS", represented as 1, and "Gene does not have a X--box and/or FOXJ1 TFBS", represented as 0. 20 We found relatively limited overlap with the previous invertebrate X--box TFBS data sets: Published data sets 27 There are a number of high--throughput cilia data sets available from the CilDB 15 that can 28 complement the data sets mentioned above. We only included data sets from mammalian species to 29 avoid significant issues with orthology (i.e. avoid mapping to paralogous genes) and which had a 30 broad coverage of the entire genome/proteome (Supplementary fig. 1). We excluded data sets that 31 focused specifically on the centriole/basal body, since this structure is also affected by the cell 32 cycle 77,78 and therefore could potentially skew the Bayesian classifier towards this process. We 33 avoided redundancy in the data sets by selecting only one data set per experiment type (e.g. 34 proteomics, expression). We only considered proteomics datasets specifically generated for the 35 cilium as opposed to whole cell proteomics data sets to minimize false positives. The data from 36 Ross et al. 20 (expression) and Liu et al. 19 (proteomics) were selected based on these criteria. These 1 data sets were extracted from CilDB 15 and implemented using the predefined confidence categories 2 from CilDB (Low confidence, medium confidence, high confidence). Genes not covered by these data 3 sets were assigned to the "not found" category. 4 Training sets 5 We used the SYSCILIA Gold Standard (SCGS) 5 as our positive training set. We did not use GO 6 annotations as we regard the SCGS, that has been annotated by experts in the cilium field, of higher 7 quality Furthermore, having an independent cilium genes dataset allowed us to prevent circularity 8 in the expression screening analysis (see below). We are currently in the process of improving cilia--9 related GO terminology and transferring our SCGS annotations 79 . The SCGS, or positive set, contains 10 a total of 302 manually curated human ciliary genes. We constructed a negative set by selecting 11 genes annotated in GO to function in processes and cellular compartments we deemed least likely 12 to be (in)directly involved in ciliary processes. We selected genes annotated with at least one of the 13 following GO Cellular Component terms: extracellular, lysosome, endosome, peroxisome, ribosome, 14 and nucleolus. We ensured that the positive and negative training sets do not overlap by removing 15 genes found in both from the negative set. Since the majority of human genes are expected to be 16 non--ciliary the similarity between the score distribution of the negative set and the remaining 17 "other genes" indicates that the negative set overall gives an excellent representation of what we 18 reasonably can expect to be non--ciliary genes. The final negative set contains 1275 genes. 19 We adapted the positive training set for the PPI data sets as well as the expression screen data set 20 to avoid overtraining. Since many of the baits used in the PPI data are known ciliary components 21 and therefore part of the positive training set, this could lead to a potential overestimation of the 22 predictive value of the data sets. Therefore we excluded the bait proteins from the positive set for 23 evaluating the Y2H, TAP--MS and SILAC data. The training of the expression screen was performed 24 using a gene set of ciliary components from GO (GO:Cilium) and to avoid overtraining we 25 subtracted the overlap from the positive set used for the Bayesian classifier. In this way we avoided 26 inflation of the predictive values for these data sets. 27 Bayesian classifier 28 Performance and false discovery rate calculations 29 The performance of each data set for predicting ciliary genes as well as the integrated Bayesian 30 classifier was evaluated using the positive and negative training sets. We determined the fraction, 31 or recall, of the positive training set, which is also known as the sensitivity or true positive rate 32 (TPR). 33 1 Where True Positives (TP) are true ciliary genes that are correctly discovered by the data set, and 2 False Negatives (FN) are true ciliary genes that were not discovered by the data set. We also 3 determined the fraction of the negative set retrieved by the data set, also known as the false 4 positive rate (FPR). 5 6 Where True Negatives (TN) are non--ciliary genes that are correctly excluded from the data set, and 7 False Positives (FP) are non--ciliary genes incorrectly discovered in the data set. The FPR together 8 with the TPR are used to calculate the predictive ability for each data set (see below). 9 The FPR is related to the specificity (SP), another well--known metric for the quality of a data set, as 10 follows: 11 12 The false discovery rate (FDR) denotes the chance of encountering a false positive among a set of 13 predictions and thus reflects the trustworthiness of a prediction. The FDR is given by: 14 15 16 We use the FDR to determine the overall performance of the classifier given a specified threshold of 17 the CiliaCarta Score. However, the FDR depends on both training sets and is sensitive to deviations 18 in set size compared to the actual populations of positives and negatives for the whole genome, and 19 thus we need to correct the canonical FDR equation for the differences in the population and set 20 size: 21 22 Throughout the text we refer to this adjusted FDR as the corrected FDR (cFDR). The positive 23 predictive value (PPV) is directly related to the FDR and reflects the probability that a gene within 24 the set threshold will indeed be ciliary: 25 The PPV and FDR therefore reflect the chance of success or failure within a set of genes rather than 1 for individual genes; i.e. they do not reflect the per gene probability for it being ciliary or not. 2 Instead, the per--gene probabilities are reflected by the posterior odds, the CiliaCarta Score, see 3 below. 4 5 At a cFDR cut--off of 25% our rank--ordered list of predictions covers 404 genes. However, this set 6 contains genes from both the negative and the positive training sets, and after excluding those we 7 obtain 285 predictions. This remaining set will however have a different FDR since we removed the 8 training set genes. We obtain the FDR for the remaining 285 genes by: (i) calculating the number of 9 TP and FP based on the cFDR threshold, (ii) subtracting the number of genes from the training sets 10 (TP --genes in positive set, FP --genes in negative set) and (iii) recalculating the FDR using these 11 adjusted TP and FP values. This results in a FDR of 33% for the 285 candidate predictions. 12 13 In our experimental validations 24 out of 36 genes are positive for ciliary phenotype and/or 14 localization. We can derive an observed FDR directly from these numbers, i.e. 24 TP and 12 FP. The 15 observed FDR is thus 12/36 = 33%, the same as the estimated FDR for our candidate genes. 16

Calculation of the CiliaCarta Score using naive Bayesian integration 17
Naive Bayesian integration allows a direct comparison and weighing of many and diverse data sets 18 describing the properties of ciliary genes and integrates these into a single probabilistic score for 19 each gene accommodating for missing data 7,8,80,81 . This approach has been successfully used to 20 predict for instance mitochondrial 81 and innate immunity 8 genes. 21 For a given gene in the human genome we calculate the conditional probability that the gene is 22 involved in ciliary processes given the observed evidence in the data sets. For our purposes, since 23 we have only two possible outcomes (i.e. ciliary vs. non--ciliary) it is more convenient and 24 appropriate to use odds instead. We can write the probability that a gene is ciliary given the 25 outcome of the experiment in data set i ( ) for all data sets j: 26 27 These odds cannot be calculated directly, but we can obtain them using Bayes' theorem by 28 approximating the reverse likelihood ratio L that a gene is observed in the data sets given it is 29 either ciliary, or non--ciliary: 30 31 We can calculate this likelihood directly from the distribution of the training sets. This equates to 1 the ratio between the data sets' true positive rate for ciliary genes (i.e. the proportion of known 2 ciliary genes retrieved), and the false positive rate for non--ciliary genes (i.e. the proportion of 3 known non--ciliary genes retrieved). Using Bayes' theorem we can now obtain the final (posterior) 4 odds from the likelihood ratio L as follows: 5 6 Where is the prior odd, i.e. the odds for a gene to be ciliary if one would randomly sample 7 from the genome: 8 9 And thus we obtain the posterior probability for data sets D1…j: 10 11 Finally we obtain the CiliaCarta Score by log2 transformation of the individual terms to get an 12 additive score: 13 14 The additive nature of the log transformed posterior odds equation makes the contribution of each 15 data set to the final score more insightful. It also makes the score robust against rounding errors 16 otherwise encountered during multiplication of the untransformed odds. The log--odds for each 17 individual data set per sub--category are listed in Supplementary table 7. 18

Determining the prior 19
One of the elements in Bayesian calculations is estimating the prior: the a priori expectation of how 20 many ciliary genes there are in the human genome. Although the ranking of genes does not depend 21 on the prior, it is required to obtain a corrected false discovery rate (see above). Furthermore it 22 gives a meaning to the posterior log odd scores (the CiliaCarta score), i.e. a positive log odd means 23 that the gene is more likely to be ciliary than non--ciliary and a negative log odd means that it is 24 more likely to be non--ciliary. 25 To our knowledge there is no substantiated estimate for the number of genes involved in the cilium. 26 Currently 608 human proteins have been annotated as being part of the cilium in a combination of 27 GO (GO:0005929 Cell Component Cilium & GO:0042384 Biological Process Cilium Assembly, 28 Ensembl biomart as of December 3rd 2015) and the SCGS, but we can reasonably assume the total 1 number of ciliary genes to be much higher. For instance, the ciliary proteome as identified by Liu et 2 al. in the mouse photoreceptor sensory cilium entails 1185 to 1968 proteins, depending on the 3 stringency of the filters applied 19 . We have chosen a prior that we deem to be both reasonable and 4 conservative: 5% of the human genome (i.e. 1135 ciliary genes). The then becomes: 5 6

Conditional independence 7
An assumption of our naïve Bayesian approach is that the outcome of one data set is independent of 8 the outcome of another. This assumption of independence is not always attainable, since for 9 instance gene expression is to some extent biologically correlated to the presence of proteins in the 10 proteomics data sets. Violations of the independence assumption can bias the predictions and can 11 lead to an overestimation of the likelihood scores. However previous work has shown that, 12 regardless of biological correlations between data sets, naive Bayesian integration of genomics data 13 is highly effective to predict novel genes involved in a molecular system 7,81 . Analysis of the 14 correlations suggests that the data sets used to predict ciliary genes are largely complementary 15 ( Supplementary Fig. 10). Several data sets have high correlations, such as ciliary co--evolution and 16 co--expression. However these data sets are methodologically and experimentally completely 17 unrelated and thus the high correlation is purely based on the ability of the methods to predict 18 ciliary genes. 19 Ciliome size estimation 20 The Bayesian framework can be used to obtain a systematic estimate for the total number of ciliary 21 genes in two ways. The first approach involves fitting a new based on the observed validation 22 rate of our experiments; that is, we make use of the discrepancy between the expected and 23 experimentally determined number of hits. We  (24 and 12 resp.). The and for the cFDR can be obtained from the training sets. We then try 29 to find an for which the following equation holds: 30 31 Atypically, in our study the observed FDR and the cFDR are equal (i.e. 33%), which indicates that 1 essentially equals our initially chosen (1135 ciliary genes). 2 The second approach to estimate the ciliome size is based on the Bayesian posterior probabilities 3 obtained for each gene (the CiliaCarta Scores). To determine the expected value (i.e. the number of 4 true positives) among a set of ciliary candidates, we considered each gene to be a random variable 5 with a binary outcome (success, a true ciliary gene, or failure, not a ciliary gene) whose probability 6 of success is defined by . Effectively this corresponds to a Bernoulli process 7 with different probabilities for success or failure for each successive trial. Assuming independence, 8 the expected value E for a set of n binary random variables (i.e. ciliary candidates) is equal to the 9 sum of the expected values of the individual variables. The expected value for an individual binary 10 random variable, in turn, equals its probability of success. From the posterior odds: 11 12 We can obtain the probability that a gene is ciliary by rewriting the above as: 13 14 Since the CiliaCarta Score (CCS) is the log2 of the posterior odds we finally get: 15 16 And thus to obtain the expected value E becomes: 17 18 The expected value for the number of ciliary genes based on the Bayesian posterior probabilities 19 was found to be 1273. It should be noted that the posterior CiliaCarta probabilities, and hence the 20 expected value for a set of genes, depend on the prior estimation of the number of ciliary genes. We 21 have above already established that our chosen prior was accurate based on our validation 22 outcome. Indeed the posterior expected number of ciliary genes is close to the prior expected 23 number of ciliary genes (1135, difference of 138). Averaging these two estimates we arrive at a 24 total number of expected ciliary genes of approximately 1200 genes. 25

Candidate selection 27
The number of genes tested per method, the model organisms and cell--lines were determined by 28 time and resources available to the participating labs. Orthologs in C. elegans were identified for the 29 candidate list. For 21 orthologs null alleles were available from the Caenorhabditis Genetics Center 1 (University of Minnesota, USA). All 21 null mutants were obtained and tested. Candidates for the 2 eCFP localization studies in hTERT--RPE1 cells were randomly selected. The candidates for the 3 immunofluorescence in human lung epithelial and murine retina were selected based on the 4 availability of a suitable antibody in the collaborating labs. Candidates tested in zebrafish were 5 randomly selected based on the presence of unambiguous 1--1 orthologs. An equal spread of the 6 selected candidates throughout the ranked list was taken into account as much as possible as is 7 shown in Fig. 4h. Vectashield containing DAPI (Vector Laboratories). The cellular localization of eCFP--fused proteins 26 was analyzed with a Zeiss Axio Imager Z1 fluorescence microscope equipped with a 63x objective 27 lens. Optical sections were generated through structured illumination by the insertion of an 28 ApoTome slider into the illumination path and subsequent processing with AxioVision (Zeiss) and 29 Photoshop CS6 (Adobe Systems) software. 30

Immunofluorescence microscopy of cells 31
IMCD3 and ATDC5 cells were growth to 80% confluence in DMEM--Glutamax medium with 10% 32 Foetal Bovine Serum. Then cells were Serum--starved for 24 hours and fixed in cold methanol for 5 33 minutes, PBS washed and blocked with 1% Bovine Serum Albumin for 1 hour before incubating 34 with primary antibodies overnight at room temperature. Antibodies and concentrations were anti--  (Cologne,  18 Germany). For pre--embedding electron microscopy, we used biotinylated secondary antibodies 19 (1:150; Vector Laboratories, Burlingame, CA, USA). For immunofluorescence microscopy, the eyes 20 of adult C57Bl/6J mice were cryofixed sectioned, and immunostained as described previously 86 }. 21 We C6orf165 were counterstained with FITC--labeled cone photoreceptor marker peanut agglutinin 25 (PNA 87-89 ). After one--hour incubation at room temperature with the according secondary 26 antibodies and the nuclear marker DAPI, sections were mounted in Mowiol 4.88 (Hoechst,  27 Frankfurt, Germany). Images were obtained and deconvoluted with a Leica LEITZ DM6000B 28 microscope (Leica, Wetzlar, Germany) and processed with Adobe Photoshop CS with respect to 29 contrast and color correction as well as bicubic pixel interpolation. We applied a pre--embedding 30 labeling protocol as previously introduced for immunoelectron microscopy of mouse photoreceptor 31 cells 90-92 . Ultrathin sections were analyzed with a transmission electron microscope (TEM) (Tecnai 32 12 BioTwin; FEI, Eindhoven, The Netherlands). Images were obtained with a charge--coupled device 33 camera (SIS MegaView3, Olympus, Shinjuka, Japan) and processed with Adobe Photoshop CS 34 (brightness and contrast). 35

Zebrafish handling and experiments 1
Wild--type (AB × Tup LF) zebrafish were maintained and staged as described previously in 93 . 2 Antisense MO oligonucleotides (Gene Tools) were designed against the start codons and against 3 splice sites, as described in Supplementary table 6. MOs were injected (4-6 ng) into embryos at the 4 1--to 2--cell stage and reared at 28.5°C until the desired stage. For cilia immunostaining, 6 somite--5 stage or 24 hpf (hours post fertilization) embryos were dechorionated and fixed in 4% PFA 6 overnight (O/N) at 4°C, dehydrated through 25%, 50% and 75%, methanol/PBT (1% Triton X--100 7 in PBS) washes and stored in 100% methanol −20°C. The embryos were rehydrated again through 8 75%, 50% and 25% methanol/PBT washes. Embryos of 24 hpf were permeabilized with Proteinase 9 K (10ug/ml in PBT) for 10 minutes at 37°C, and subsequently refixated in 4% PFA. Prior to 10 immunostaining, embryos were incubated in block buffer (5% goat serum in PBT) blocked with 5% 11 goat serum (in PBT)  Nuclei were stained with Hoechst and embryos were mounted in Citofluor. Z--stack images were 16 captured using a Zeiss 710 Confocal Microscope. For rescue experiments, the aforementioned 17 human OSCP1 cDNA was cloned into pCS2+ using gateway technology. OSCP1 plasmids were 18 linearized using NotI and mRNA was synthesized using Ambion mMessage mMachine kit for the 19 sense strand. 100 pg of mRNA was injected into the cell of one cell-stage embryos. These embryos 20 were subsequently injected with 4 ng oscp1 MO at the two--cell stage, and embryos were allowed to 21 develop at 28.5°C. 22

C. elegans handling and experiments 23
C. elegans were maintained and cultured at 20°C using standard techniques. Mutant strains were 24 obtained from the Caenorhabditis Genetics Center (University of Minnesota, USA); the alleles used 25 are shown in Supplementary table2. Assays for dye uptake (DiI), roaming and osmotic avoidance 26 were performed as previously described 23 . Briefly, for the dye--filling assay, worms were placed into 27 a DiI solution (diluted 1:200 with M9 buffer) for 1 hour, allowed to recover on NGM plates, and then 28 imaged (40x objective, Texas Red filter set) on a compound epi--fluorescence microscope (Leica 29 DM5000b), fitted with an Andor EMCCD camera. For the osmotic avoidance assay, young adult 30 worms were placed within a ring--shaped barrier of 8M glycerol and scored during 10 minutes for 31 worms that crossed the barrier. For the roaming assay, single young adult worms were placed for 32 16 hours onto seeded plates and track coverage assessed using a grid reference. For transmission 33 electron microscopy, oscp--1(gk699) worms were first backcrossed 2 times with wild type worms 34 (to remove unlinked mutations), using primers that flank the gk699 deletion. Day 1 adults were 35 fixed, sectioned and imaged as described previously 23 . Translational reporters were introduced into 36

25
The PPV of the combined validation converges to 0.67, which equals the predicted PPV (0.67, given 0.33 FDR).

26
The asterisks (*) above the x--axis denote the ranks of the candidate genes and proteins tested for ciliary 27 function or localization.  weak phenotype is 51% and embryos with a strong phenotype is 28%. Those percentages change when 6 ng 30 of morpholino are used with 31% of embryos showing with a weak phenotype and 27% with a strong 31 phenotype, (strong and weak phenotypes described in Supplementary Fig. 7). The number of dead embryos 32 increases when 6ng of oscp1 morpholino are used (38%) compared with 4 ng of oscp1 morpholino (10%). 33 We injected zebrafish embryos at one cell stage with 4 ng of oscp1 morpholino and 100 pg of human OSCP1 34 mRNA. The rescue increased the normal phenotype percentage from 8% to 43% and decreased the weak 35 phenotype from 51% to 27% and the strong phenotype from 28% to 9.5%. Significance was determined by 36 χ2 test, p<0.0001.   4 photoreceptor inner segment. Roaming: "++" normal, "--" defective (p<0.0001). Dye uptake: "++" normal, 5 "+/++" slightly reduced uptake, "+" mild reduced dye uptake, "--" defective dye uptake.