ge-CRISPR - An integrated pipeline for the prediction and analysis of sgRNAs genome editing efficiency for CRISPR/Cas system

Genome editing by sgRNA a component of CRISPR/Cas system emerged as a preferred technology for genome editing in recent years. However, activity and stability of sgRNA in genome targeting is greatly influenced by its sequence features. In this endeavor, a few prediction tools have been developed to design effective sgRNAs but these methods have their own limitations. Therefore, we have developed “ge-CRISPR” using high throughput data for the prediction and analysis of sgRNAs genome editing efficiency. Predictive models were employed using SVM for developing pipeline-1 (classification) and pipeline-2 (regression) using 2090 and 4139 experimentally verified sgRNAs respectively from Homo sapiens, Mus musculus, Danio rerio and Xenopus tropicalis. During 10-fold cross validation we have achieved accuracy and Matthew’s correlation coefficient of 87.70% and 0.75 for pipeline-1 on training dataset (T1840) while it performed equally well on independent dataset (V250). In pipeline-2 we attained Pearson correlation coefficient of 0.68 and 0.69 using best models on training (T3169) and independent dataset (V520) correspondingly. ge-CRISPR (http://bioinfo.imtech.res.in/manojk/gecrispr/) for a given genomic region will identify potent sgRNAs, their qualitative as well as quantitative efficiencies along with potential off-targets. It will be useful to scientific community engaged in CRISPR research and therapeutics development.

Scientific RepoRts | 6:30870 | DOI: 10.1038/srep30870 Due to the efficiency and simplicity of sgRNA and Cas9, CRISPR/Cas9 has been successfully utilized to edit genomes of different life forms like humans, mice, nematodes, flies, plants etc. [14][15][16][17] . George Church et al. has demonstrated editing in human genome using CRISPR/Cas9 for the very first time 18 for targeting AAVSI locus fragments. Subsequently, this technique was employed to engineer mouse embryos, and also to grow stem cells in culture 19 . Multiple sgRNAs has also been employed to prompt large inversions or deletions between the double stranded breaks in order to generate simultaneous mutations at more than three genes in Zebrafish 20 . Other studies have also highlighted the use of this approach in food crops like rice (Oryza sativa) and wheat (Triticum aestivum) targeting phytoene desaturase gene efficiently 21 .
This new technology also helped in culminating various genetic disorders and has a potential to create scar less modifications that was not possible earlier 22 . Apart from this, CRISPR has been effectively applied to alter genomes of viruses like Human immunodeficiency virus (HIV) 23 , Human papillomavirus (HPV) 24 , Epstein bar virus (EBV) 25 etc. In addition to target DNA sequences CRISPR/Cas9 from Francisella novicida has been reprogrammed to alter viral RNA i.e + ssRNA virus (hepatitis C virus) in eukaryotic cells leading to inhibition of viral protein production 26 .
CRISPR/Cas based genome editing studies have been flourished in no time and hence lead to development of a few bioinformatics repositories. We have developed "CrisprGE" a central hub to provide comprehensive information of genome editing by this technique. This database harbors genes/genomes edited, organisms, experimental procedures and modifications along with its efficiency 27 . There is another depository "CRISPRz" just having details of genome editing in zebrafish 28 .
Computational approaches have helped the researchers in CRISPR/Cas based genome editing applications. First step in genome editing is the identification of sgRNA 29 , for which few tools have been developed namely GT-Scan 30 , CRISPRdirect 31 , SSFinder 32 and E-CRISPR 33 etc. These tools extract 20 nucleotides sequences adjacent to "NGG" PAM motif in the genomes as sgRNAs. In aditition to identification of sgRNAs more significant approaches are needed to access the activities of Cas9 at off-target sites i.e. sites having few nucleotides difference as compared to the target site. For this in-silico tools CasOT 34 and Cas-OFFinder 35 are available.
Although, above tools identify many sgRNAs for a particular gene/genome but not all sgRNAs have equal efficiency. Infact, their genome editing efficiency may range from least to very high. Therefore, algorithms for selecting highly specific and efficient sgRNA from a pool of identified sgRNAs are highly desired. Few methods have been developed to address this issue like sgRNAdesigner 36 , WU-CRISPR 37 , sgRNAScorer 38 and CRISPRscan 39 . Amongst them, sgRNAdesigner 36 , WU-CRISPR 37 , sgRNAScorer 38 classify library of sgRNAs on the basis of its high and low activity while CRISPRscan is dedicated to predict the actual efficacy of sgRNAs.
However, existing methods have certain shortcomings like these algorithms utilized in house datasets generated on a specific experimental setting. Present tools have exploited limited sequence features and their hybrids, also the set of sgRNA used in each study are different. Prediction tools of sgRNA used different methods for estimating efficacy i.e. some has employed readout assay and others have used phenotype based readout method to predict specificity and efficiency of a particular sgRNA. Existing methods are also limited to design sgRNAs against limited eukaryotic genomes like Homo sapiens, Mus musculus, Danio rerio and Xenopus tropicalis etc.
In the present study, we tried to address the inadequacies in the existing tools. For this we have compiled high throughput experimental data and explored numerous sgRNA sequence features, to see their effect on the activity of sgRNA. Further, we have developed Support vector machine (SVM) based predictive models to calculate qualitative and quantitative efficiency of sgRNAs better than the existing tools. The same is assimilated in a web server "ge-CRISPR" which is an integrated pipeline to predict sgRNAs with high on target and minimum off target activity. This algorithm will be very beneficial to select single sgRNA with high potency from a pool of identified sgRNAs.

Results
By complying data from diverse platforms we have generated 2 pipelines i.e. pipeline-1, which classify sgRNAs into high and low potency and pipeline-2 quantifies calculates actual efficiency of a particular sgRNA. Employing several sgRNA sequence features and using SVM, we have generated predictive models and results of which are specified below.
Performance evaluation of pipeline-1 (geCRISPRc) predictive models during 10-fold cross validation and on Independent datasets. In ge-CRISPR various sgRNA sequence features have been employed to develop predictive models namely composition profile, binary profile, secondary structure, thermodynamic features and their hybrids.
Comparably, for binary profile we achieved enhanced performance in individual features viz. mono and di (1-2-3 degree) rather than their hybrids. Accuracy and MCC in case of mono, di ( Besides 10-fold cross validation, independent dataset was used to evaluate the final performance of the models as described in Table 1. Amongst all, the best performing feature in training/testing i.e. di (1-degree) binary profile performed equally well on independent dataset with accuracy and MCC of 88.80% and 0.78 correspondingly.
ROC plot for validating threshold independent performance of hybrid models. To check the threshold independent performance of various hybrid models, receiver operating characteristic (ROC) curve was plotted using R. ROC displays area under the curve (AUC) which is formed by plotting sensitivity (true positive rate or recall) against 1-specificity (false positive rate or precision). It determines which of the used models predicts the given classes best. AUC values of above mentioned features ranges from 0.54-0.93 and the best values were obtained for hybrids of composition mono-di-tri (0.88); dinucleotide (1-degree) binary (0.92) and hybrid of composition mono-di-tri and dinucleotide (1-degree) binary (0.93). AUC of 1 for a model determines its goodness therefore; models with higher AUCs are preferred more than with lower AUCs. ROC plot of hybrids is depicted in Fig. 1.
Performance evaluation of pipeline-2 (geCRISPRr) predictive models during 10-fold cross validation and on independent datasets. We have utilized 22 sequence features to calculate the quantitative efficiency of sgRNA using SVM. During 10-fold cross validation on training/testing (T 3619 ) dataset, we obtained PCC of 0.32, 0.33, 0.44, 0.44 and 0.34 respectively for mono, di, tri, tetra and penta nucleotides. Whereas hybrids features like mono-di, mono-di-tri, mono-di-tri-tetra, and mono-di-tri-tetra-penta nucleotides increased PCC to 0.52, 0.58, 0.58 and 0.57 respectively. Likewise in classification, predictive models based on regression performed best on binary profile and increased PCC to 0.68 at di (1 degree) and also with mono-di (1 degree) binary feature. Thermodynamic properties exhibited correlation of 0.44 and secondary structure did not perform well. Apart from these we also checked the performance of all hybrid features, which are briefed in Table 2.
In addition to 10-fold cross validation, we used 520 sgRNAs for independent validation of the models. All the predictive models performed equally well on independent datasets. Best models of composition and binary profiles displayed PCC of 0.56 and 0.69 respectively. It shows that predictive models are well trained and can be employed to predict efficiency of unknown sgRNA. Results of all the sgRNA features on validation dataset are summarized in Table 2.
Analysis of sgRNAs using nucleotide position. In order to analyze which nucleotides are favoured in sgRNAs, we studied the frequency of each of the four nucleotides i.e A, T, C, and G at 20 positions of sgRNA upstream to PAM using two sample logos. We have utilized 1021 positive (having > 50% efficiency) and 1069 negative (< 5% efficiency) sgRNA sequences. Positional analysis revealed that Guanine is favoured at 1 st , 2 nd and 20 th position followed by Cytosine at 16 th and 18 th position and Adenine at maximum positions. Thymidine has  Web server. geCRISPR is freely accessible at http://bioinfo.imtech.res.in/manojk/gecrispr/. In order to predict the best sgRNA, user need to input desired gene or genome sequence in fasta format. The output displays potential sgRNA with PAM, start and end position, GC percentage and potency. We have integrated 2 pipelines that calculate efficiency both qualitatively as well as quantitatively along with off-targets associated with that   Table 2. Performance of pipeline-2 (geCRISPRr) predictive models on Training/testing (T 3619 ) and independent datasets using (V 250 ) Support vector machine during 10-fold cross validation. PCC, Pearson correlation coefficient. particular sgRNA. Integrated pipeline for the prediction and analysis of sgRNA genome editing efficiency operates in 4 steps i.e sgRNA scanner, pipeline-1, pipeline-2 and off-target analysis.
sgRNA scanner. For this we have developed in-house perl script. User is required to paste single or multiple genes/genomic segments in FASTA format in the text box provided. sgRNA scanner identifies putative sgRNAs by scanning the user provided gene/genome in both sense and antisense strand on the basis of NGG PAM sequence. It then displays all possible sgRNAs (20-nucleotide) upstream to PAM. One can select from any of the given pipelines i.e. pipeline-1 and pilpeline-2.
Pipeline-1 (geCRISPRc). It will help to classify sgRNA scanned by sgRNA scanner into high and low potency (qualitatively). Output display serial no. of sgRNA, sgRNA sequences, PAM, start and end coordinates, GC%, sgRNA potency respectively. By clicking on an individual entry it will display complete sgRNA profile along with the secondary structure, sgRNA map and off-targets for that particular sgRNA as depicted in Supplementary Fig. S1.
Pipeline-2 (geCRISPRr). This algorithm will help to quantitatively predict the actual efficiency of sgRNA in terms of percentage (0-100%). Similar to pipeline-1 output of pipeline-2 exhibit serial no. of sgRNA, sgRNA sequences, PAM, start and end coordinates, GC% but instead of qualitative potency it will calculate the quantitative efficiency of an sgRNA for genome editing (Fig. 3a,b).
Off-target analysis. Efficient sgRNA should target on desired site (on-target) with no or partial homologous sites (off-targets) elsewhere in the target genome. An analysis tool is integrated to screen on/off-targets for the particular sgRNA in the model organisms i.e. Homo sapiens, Mus musculus, Drosophila melanogaster, and Denio rerio. In this study, standalone blast (blastn algorithm) was used to map putative sgRNAs to whole genomes. This is accomplished in two modes: Complete-sgRNA searching mode. Using the complete sgRNA sequence of 23nt including NGG (PAM) motif to evaluate potential off-target sites in both strands of target genome with the search criteria i.e. percent identity with target should be more than 90% (~1 or 2 mismatch allowed).
Seed-sgRNA (~12nt + 3nt (PAM)) searching mode. In this, only a seed region 29 (i.e. 12nt 5′ proximal to the NGG PAM motif) of particular sgRNA is utilized to screen potential off-target in both strands of target genome. As any mismatch in seed region is the most critical determinant of specificity than the distal (non-seed) ~8 nucleotides therefore, 100% sequence identity parameter was used.
Off-target analysis is represented in two ways. First, as tabular output, that integrates information of Query-ID, Subject-ID, Percent Identity, Alignment length, Mismatches, Gap opens, Query start-end, subject start-end, e-value and score. Second, it provides detailed output with mapped sequence of query sgRNAs and potential off-targets with the details of gaps and mismatches as illustrated in Fig. 3c. Comparison with existing sgRNA prediction algorithms. As already mentioned, a few tools for predicting the efficiency of sgRNA have been developed. We have compared models generated by us with existing servers. These web servers are developed using their in house data and in comparison we have utilized available high throughput data on a single platform to build universal model which may be applicable to all. In addition to this we have performed both classification and regression analysis that is not done before. We compared our ge-CRISRc with the algorithm developed in classification mode e.g. sgRNA designer, WU-CRISPR and sgR-NAScorer. Moreover, for regression analysis we compared geCRISPRr with CRISPRscan. We have employed various sequence features and results of our best models are shown in Tables 3-4 while details of all features used are given in Supplementary Tables (S1-4). Our algorithms (classification and regression) performed better than existing ones despite having data of different platforms.

Discussion
Genome editing remains an important scientific endeavor to make desired changes in the genomic regions 40 . It has been accomplished successfully using various nucleases like like zinc fingers nucleases (ZFs) and transcription activator-like effector nucleases (TALENs) etc 41,42 . Initial methods used customizable DNA binding domains fused with FokI nuclease domain that cuts non-specifically. Although these approaches have made several progresses but still have their own limitations i.e. need for engineering new enzyme for every target site, which is very costly and time consuming. In 2012 Doudna et al. have demonstrated the use of bacterial CRISPR-Cas9 system in genome editing for the first time 9 . This method provides specificity and multiplicity, which is far beyond the capacity of the earlier methods. Due to its ease this method has also been selected as the "Breakthrough of the year" in 2015 by Science magazine 43 .
To access the ability of CRISPR/Cas for genome editing identification of sgRNA adjacent to NGG PAM motif is the first step. For which several Insilco tools i.e. GT-Scan 30 , CRISPRdirect 31 , SSFinder 32 and E-CRISPR 33 have been developed. These tools although identify putative sgRNAs in a genome but does not predict their editing efficiency. Subsequently, focus shifted on to develop such algorithms, which can predict the actual efficiency of sgRNA.
A few tools have been developed to design active sgRNAs but on different platforms. For example, sgRNAdesigner 36 is a method in which authors have constructed predictive models to calculate activity of sgRNAs against 6 human and 3 mice genes. By utilizing their dataset (1841) we developed SVM models and achieved better accuracy and MCC of 97.28% and 0.95 respectively (Table 3, Supplementary Table S1). Later on, Wong et al. developed another webserver WU-CRISPR 37 using same dataset with AUC of 0.92. Our geCRISPRc method performed better with AUC of 0.99 even in comparison to WU-CRISPR. In another study, Chari et al. 38 have also analyzed sequence features to identify sgRNA activity of ~1400 genes of human and mice with accuracy of 73.2% (Cas9 sp ) and 81.5% (Cas9 st1 ). Again on the same dataset (Cas9 sp 279, Cas9 st1 171), we obtained comparatively better results with an accuracy of 76.7% and 83.44% correspondingly ( Table 3, Supplementary Tables S2-3). Finally, we compiled diverse dataset of sgRNAs from different platforms in geCRISPRc and managed a better performance with an accuracy and MCC of 81.17% and 0.75 respectively (Table 3).
Simultaneously, Mateos et al. have developed CRISPRScan 39 a regression based method for predicting actual efficiency of an sgRNA instead of just classifying it as either effective or non-effective. This tool was developed using in-house 1020 sgRNAs against 128 genes of Danio rerio and attained a correlation of 0.45 on training dataset. We have exploited this dataset also and developed a regressive model that attained similar performance. Contrary to this, our geCRISPRr algorithm based on larger dataset (3619) achieved far better performance with PCC of 0.68 (Table 4, Supplementary Table S4).
We have also cross checked the performance of earlier predictive models (developed on a particular dataset) on another experimental sgRNA dataset. We observed that predictors which are made on similar platforms performed well on each other but they did not achieve better results otherwise. Therefore, there is an irresistible need to develop algorithms using sgRNA datasets from diverse experimental conditions. Here, we attempted to develop generalized predictive SVM models that works on heterogeneous data of experimental sgRNAs in both classification and regression mode.
For ge-CRISPR, we have exploited various sgRNA sequence features like compositional profile, binary profile, thermodynamic properties, secondary structure and their hybrids. Amongst these binary features performed best followed by thermodynamic and compositional profile, whereas secondary structure did not achieved satisfactory results. Moreover, hybrid features achieved similar performance to that of binary. Of the binary features explored, 1-degree binary have shown highest accuracy and correlation. It could be attributed to the fact that in 1-degree binary both composition and neighbouring residue positions are taken into account simultaneously. Similar observations of binary features have also been reported by others. Apart from the development of SVM based predictions these models have been integrated into a user-friendly web server i.e. "ge-CRISPR". It works in a pipeline like it firstly scan genomes on the basis of "NGG" motif and extract 20 nucleotides upstream to this motif. This will limit off target upto some extent as unique 20 nucleotides target sequence and NGG will not occur more likely on other locations of the genome. After extraction of all possible sgRNAs another algorithm geCRISPRc will differentiate these sgRNAs into high and low potency along with its GC content and start-end position in the genome. geCRISPRr will allow users to have quantitative estimate of an individual sgRNA by calculating its efficiency in range of 0-100%. Additionally, off target tool is also incorporated to further know off targets associated with sgRNA in the seed region or in complete sgRNA if any. This integrated pipeline will surely help to have highly active sgRNA for intended targeting of genomes prior to experimental validation.
Future developments and conclusion. We will further update our algorithm to increase its predictive potential once enough data accumulates. It would be interesting in future to explore the role of other features like dynamical modeling or spatial effects [44][45][46] etc. to analyze activity of sgRNAs. For example Sciabola et al. have used 3D structural information of siRNA duplex as a descriptor to determine the activity of siRNA 47 . Lu et al. have also developed a tool to explore and annotate 3D structures of molecules like viral tRNA mimic, SAM-1 riboswitch, Cas9-sgRNA-DNA etc. 48 .
The performance of ge-CRISPR pipeline is better than earlier methods as it holds potential to predict both quantitative and qualitative efficiencies of sgRNAs for diverse organisms. We foresee that this integrated pipeline will help the scientific community in accelerating CRISPR based genome editing and therapeutics.

Methods
Data Curation. We have extracted sgRNA sequences from different studies namely sgRNAdesigner (1841 and additional 1278) 36 , CRISPRScan (1020) 39 and sgRNAscorer (430) 38 . For sgRNA designer, data was freely available in supplementary information in the form of sgRNA sequences (1841, 1278) and their respective scores. For sgRNAScorer, data was available in supplementary information as high and low activity sgRNAs (Cas9 sp 279, Cas9 st1 171). For CRISPRscan, the author has kindly provided the dataset with sgRNA sequences (1020) and efficiency on our request. Overall data was categorized for classification and regression studies. In classification mode, total of 2090 sgRNAs (including potent 1021 sgRNAs having > 50% efficiency and 1069 sgRNAs with < 5% efficiency) were used. Further, overall sgRNA sequences were divided into Training/Testing (T 1840=895pos+945neg ) and Validation (V 250=126pos+124neg ) datasets. In regression mode, 4139 sgRNAs (efficiency from 0 to 100%) were splitted into Training/Testing (T 3619 ) and Validation (V 520 ) datasets. Workflow of geCRISPR is represented in Fig. 4. Composition profile. Composition profile is the frequency of each nucleotide in an sgRNA i.e. A, T, G, C. The main purpose of calculating nucleotide composition is to change alterable length of sequences into a fixed length vector 49 . This is the critical step as any machine learning technique (MLT) involves vectors of fixed length. We explored mono to penta nucleotide compositional profiles corresponding to 4, 16, 64, 256, and 1024 vectors respectively.
Binary profile. Earlier studies have utilized positions of nucleotides to calculate the activity of sgRNAs 37 . We also studied binary profiles to extract features of sgRNAs on the basis of occupancy of nucleotides in sgRNA. Different binary patterns were used i.e. mono-binary and di-binary (1, 2, 3 degree). In mono-binary profile, probability of occurrence of each nucleotide (A, T, G or C) at every position was studied. Moreover, for di-binary profile, sequences were examined on basis of 1 degree in which probability of occurrence of all contiguous residues was calculated. Additionally, for 2 degree and 3 degree di-nucleotides with all 2 nd and 3 rd adjacent residues were evaluated.
Thermodynamic properties. Thermodynamic stability of nucleotides is important as mentioned in previous studies on siRNA and sgRNA 37,50 . We have also incorporated this feature to check the thermodynamic stability of sgRNA sequences. In total 20 features were combined to calculate Gibbs free energies of different sets of interacting nucleotides.
Secondary structure property. Secondary structure of RNA represents the ability of a molecule to form intramolecular bonds for its stabilization. We calculated equilibrium base-pairing probabilities and minimum free energy (MFE) of sgRNA sequences 39 by utilizing RNA fold from Vienna RNA package 51 . Output is represented in the form of bracket and dots that symbolize pairing or unpairing of nucleotides respectively. These features were further converted into binary values i.e. 1or 0 for SVM input. Hybrid approaches. Despite using the above-mentioned features individually, in past hybrid approaches were explored 49 . In this algorithm we have also studied different hybrid features of composition and binary profile along with thermodynamics and secondary structure. Detailed list of all the features used is listed in Table 1. Support Vector Machine. Support Vector machine (SVM) is a supervised MLT used for classification and regression analysis based on statistical and optimization theory. SVM light (v6.02) (http://svmlight.joachims.org/) package was used for developing predictive models in various algorithms like QSPpred 52 , AVPpred etc. 53 . It develops models by classifying patterns in training/testing data that helps to consign categories to unidentified sequences. We used Radial basis function (RBF) kernel on varied g and c values.

10-fold cross validation.
For evaluating performance of all modules, we performed 10-fold cross validation. In this approach, the whole dataset is randomly divided into equally sized ten sets. Further, from this one set is used for testing and the remaining nine sets are considered for training and this process is then iterated ten times so that each dataset is being tested once.
Performance parameters. Performance of various models was calculated using threshold dependent and independent parameters. In threshold dependent module, we analysed performance in terms of 1) Sensitivity, 2) Specificity, 3) Accuracy and 4) MCC as calculated using following equations: Sensitivity. This parameter measures the test's capability to correctly predict positive sgRNAs from actual positive sgRNAs.
where, TP is correctly predicted positive sgRNAs and FN is falsely predicted negative sgRNAs.
Specificity. Specificity refers to test's ability to correctly predict negative sgRNAs from actual negative sgRNAs.
Accuracy. Accuracy determines the success of correctly predicted sgRNAs from overall data. It defines the closeness of predicted results to the true value and the reproducibility of the same prediction.  Mathew's correlation coefficient (MCC). MCC is a measure of calculating correlation between the observed and the predicted classification. It reconciles the imbalance between positive and negative datasets. Its value ranges from − 1 to + 1 where + 1 signifies perfect correlation, 0 depicts no well than random correlation and -1 signifies total divergence between predicted and observed classification. In threshold independent module we employed ROC to study performance of various SVM models. We generated ROC plots depicting area under curve (AUC) using ROCR statistical package available in R 54 .
Sometimes another parameter of evaluation is used called as Pearson's correlation coefficient (R). This predicts the correlation between the actual and the predicted data i.e. how the change in one variable is affecting the other variable linearly. The value of correlation ranges from perfect correlation of + 1 to total negative correlation of − 1. where n represents size of the test set while Ei pred and Ei act are the predicted and actual efficiencies of sgRNAs respectively.
Two Sample Logo (TSL). TSL was created using online Two Sample Logo tool 55 . This tool helps to graphically represent position specific differences in nucleotides between positive and negative dataset statistically. TSL contains a stack of symbols in which height of each stack represents relative frequency of a nucleotide whereas height of symbol indicates sequence conservation at that position 52 . In this study, we have utilized TSL to analyze the preference of nucleotides on the 20 positions of sgRNA. Upper section of this displays enrichment of nucleotides and lower sections signify depletion of a particular nucleotide in the given dataset at a particular position.