PrediTALE: A novel model learned from quantitative data allows for new perspectives on TALE targeting

Plant-pathogenic Xanthomonas bacteria secrete transcription activator-like effectors (TALEs) into host cells, where they act as transcriptional activators on plant target genes to support bacterial virulence. TALEs have a unique modular DNA-binding domain composed of tandem repeats. Two amino acids within each tandem repeat, termed repeat-variable diresidues, bind to contiguous nucleotides on the DNA sequence and determine target specificity. In this paper, we propose a novel approach for TALE target prediction to identify potential virulence targets. Our approach accounts for recent findings concerning TALE targeting, including frame-shift binding by repeats of aberrant lengths, and the flexible strand orientation of target boxes relative to the transcription start of the downstream target gene. The computational model can account for dependencies between adjacent RVD positions. Model parameters are learned from the wealth of quantitative data that have been generated over the last years. We benchmark the novel approach, termed PrediTALE, using RNA-seq data after Xanthomonas infection in rice, and find an overall improvement of prediction performance compared with previous approaches. Using PrediTALE, we are able to predict several novel putative virulence targets. However, we also observe that no target genes are predicted by any prediction tool for several TALEs, which we term orphan TALEs for this reason. We postulate that one explanation for orphan TALEs are incomplete gene annotations and, hence, propose to replace promoterome-wide by genome-wide scans for target boxes. We demonstrate that known targets from promoterome-wide scans may be recovered by genome-wide scans, whereas the latter, combined with RNA-seq data, are able to detect putative targets independent of existing gene annotations.

Many crop plants including rice can be infected by Xanthomonas bacteria causing 2 disease in the affected plants, which results in substantial yield losses. Many strains of 3 Xanthomonas oryzae pv. oryzae (Xoo) and Xanthomonas oryzae pv. oryzicola (Xoc) 4 express a specific type of effector protein called transcription activator-like effectors 5 (TALEs). TALE proteins function as transcription factors in infected host cells [1], and 6 contain a nuclear localization signal, a DNA-binding domain, and an activation domain. 7 The DNA-binding domain consists of tandem repeats that bind to the promoter of plant 8 target genes. Each repeat consists of approximately 34 highly conserved amino acids 9 (AAs), except for the amino acids at position 12 and 13, which are termed repeat 10 variable diresdue (RVD) and are responsible for DNA specificity. The repeat domain 11 forms right-handed superhelical structure, while the RVD is situated within a loop 12 accessing the DNA [2,3]. Each RVD binds to one nucleotide of the target box [4,5], 13 where amino acid 13 binds to the sense strand and amino acid 12 stabilizes the repeat 14 structure. Hence, the specificity of each TALE is determined by its RVD sequence. In 15 addition, most known target boxes are directly preceeded by a 'T', while 'C' and 'A' 16 occur with decreasing frequencies, which is also referred to as "position 0" of the target 17 box. 18 Some repeats deviate from the common length of 34 AAs and have, for this reason, 19 been termed aberrant repeats. Aberrant repeats may loop out of the repeat array when 20 a TALE binds to its DNA target box and by this means allow for increased flexibility, 21 also binding to frame-shifted target boxes [6]. 22 Different Xoo and Xoc strains express different repertoires of TALEs, where a single 23 strain may host up to 27 TALEs [7][8][9][10]. 24 Naturally occurring TALEs may activate susceptibility (S) genes that are responsible 25 for bacterial growth, proliferation and disease development, but also disease resistance 26 (R) genes [1]. 27 The names of TALEs and TALE classes are based on the nomenclature introduced 28 by the tool AnnoTALE [11]. TALEs are clustered according to the similarity of their 29 RVD sequence and divided into classes. 30 Target boxes upstream of all known major virulence targets are located in forward 31 orientation relative to the transcription start site (TSS). Recently, target boxes of 32 TALEs have been reported to be also functional in reverse orientation relative to the 33 transcription start site (TSS) of their target gene [12,13]. However, reverse binding 34 seems to be rather an exception than a general rule [13]. Accurate predictions of target 35 boxes of TALEs are important for studying naturally occurring TALEs and determining 36 their virulence targets, but also for the identification of target and off-target sequences 37 of artificially designed TALEs. Over the last years, several tools have been designed for 38 the in-silico prediction of TALE target boxes based on the RVD sequence of a given interactions [16]. It differs from Target Finder in deriving specificities of rare RVDs 48 from those of common RVDs with the same 13th amino acid. Target sequences may 49 only begin with nucleotide T or C, with a lower score assigned in the case of cytosine. 50 In addition, Talvez may explicitly model that mismatches are tolerated to a larger 51 degree if these are located near the C terminus [17]. Users of Talvez can choose between 52 web-based and command line applications. 53 TALgetter [18] uses a local mixture model to predict TAL target sequences. The 54 specificities were learned from 267 pairs of TALEs and target sites with qualitative 55 information whether the pair is functional or not. According to Streubel et al. [19], the 56 efficiencies of different RVDs are non-identical. The TALgetter model adapts a similar 57 concept using an importance term, which is learned independently from the specificity 58 of each RVD. TALgetter is implemented within the Java framework Jstacs [20], and is 59 available as online and command line program. 60 In the web tool SIFTED [21], specificity data from a large-scale study using 61 protein-binding microarrays (PBMs) were used for training model parameters. For this 62 purpose, 21 TALEs constructed exclusively from the most common four RVDs (NI, HD, 63 NN, NG) were designed and their binding specificity measured on ≈ 5,000-20,000 DNA 64 sequences per protein using PBMs. However, we will not consider SIFTED in the 65 remainder of this manuscripts, as the SIFTED web server is currently unavailable and  [19,22], specificities at position 0 of target boxes [23], complete exploration of all 74 possible combinations of amino acids at RVD positions [24,25], and systematic analyses 75 of those RVDs frequently used in designer TALEs [21]. 76 In this paper, we aim at developing a novel approach for modelling TALE target 77 specificities based on these quantitative data. This approach, called PrediTALE, 78 explicitly captures putative dependencies between adjacent RVDs, dependencies 79 between the first RVD and position 0 of the target box, and also includes positional 80 effects of mismatch tolerance. In contrast to previous approaches, model parameters are 81 adapted by minimizing the difference between prediction scores and quantitative 82 measurements for pairs of TALEs and target boxes. Like previous approaches, 83 PrediTALE also predicts target boxes in reverse strand orientation relative to the TSS, 84 but applies a small penalty term in this case, following the assumption that functional 85 reverse target boxes are rather rare in planta. PrediTALE is the first approach to 86 account for aberrant repeats when predicting TALE targets.

88
Training data 89 Pairs of TALEs and putative target boxes were collected from systematic, quantitative 90 experiments reported in [19,[22][23][24][25]. Data were further processed as detailed in 91 Supplementary Text S1. Basically, data were grouped by TALE, and the global weight 92 was computed as the maximum assay value for the current TALE divided by the 93 maximum assay value reported for all TALEs with the same 13th AA at any position in 94 the current assay. Target values were computed as the assay value of the current pair of 95 TALE and target box divided by maximum assay value over all tested target boxes for 96 the current TALE.

97
While the normalization of target values has a mostly technical background as it 98 simplified the selection of initial values during numerical optimization of our model (see 99 below), the definition of global weights influences the optimization result. The choice of 100 global weights has been motivated by the observation that some TALE architectures 101 (e.g., those with long successions of identical RVDs, or 12th AAs not occurring in 102 nature) show a generally lower activity than others, which also affects the influence of 103 measurement noise and, hence, the reliability of assay values. With the choice of global 104 weights proposed here, the influence of such TALEs on the final optimization result is 105 reduced, while such TALEs do not need to be completely removed from the training set. 106 As detailed in Supplementary Text S1, PBM experiments from [21] were filtered for 107 apparent data quality, normalized log-intensities were used as target values, and global 108 weights were defined uniformly for all putative target boxes from a common PBM Oryza sativa ssp. japonica cv. Nipponbare was grown under glasshouse conditions at 115 28°C (day) and 25°C (night) at 70% relative humidity (RH). Leaves of 4-week-old plants 116 were infiltrated with a needleless syringe and a bacterial suspension with an OD600 of 117 0.5 in 10 mM MgCl2 as previously described [26].

119
Rice cultivar Nipponbare leaves were inoculated with Xoo strains PXO83, PXO142, 120 ICMP 3125 T , or MgCl2 as mock control in five spots in an area of approx. 5 cm using a 121 needleless syringe. Two leaves of three rice plants each were inoculated for each strain 122 and control, respectively. 24h later, samples were taken, frozen in liquid nitrogen, and 123 RNA prepared. Three replicates of this experiment were done on separate days and 124 subjected to RNAseq analysis, separately.
"-single -b 10 -l 200 -s 40" and the cDNA sequences available from 133 http://rice.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/ 134 annotation_dbs/pseudomolecules/version_7.0/all.dir/all.cdna. Differentially 135 expressed genes relative to the respective control samples were determined by the 136 R-package sleuth [30]. 137 For the Xoo strains and the respective mock control, replicates have been paired 138 during library preparation and sequencing. Hence, the replicate was considered as an 139 additional factor when computing p-values of differential expression for the Xoo samples 140 but not for the Xoc samples. Differential expression was aggregated on the level of 141 genes using the parameter target mapping of the sleuth function sleuth prep(), and 142 b-value, p-value, and Benjamini-Hochberg-corrected q-value were recorded. The b-value 143 reported by sleuth when applying a Wald test is actually a biased estimator of the 144 log-fold change. However, as this is a more commonly understood term, we refer to the 145 b-value as "log-fold change" in the remainder of this manuscript. Gene abundances, and 146 sleuth outputs with regard to differential expression are provided as Supplementary   147 Tables T and U, respectively. RNA-seq reads were also mapped to the rice genome 148 (MSU7) to obtain detailed information about transcript coverage. To this end, adapter 149 clipped and quality trimmed reads were mapped using TopHat2 v2.1.0 [31], and the 150 resulting BAM output files were processed in further analyes described below.  The general idea of the model proposed here is to model the total binding score of a 158 putative target box x given the RVD sequence r of a TALE as a sum of contributions of 159 i) binding to the zero-th repeat, ii) binding to the first RVD, and iii) binding to the 160 remaining RVDs, where the latter two terms may be weighted by an additional, 161 position-dependent but sequence-independent term.
Here, θ = (θ 0 , θ 1 , θ m , θ p ) denote the sets of real-valued parameters of the term for 163 binding to the zero-th, first, and remaining repeats, and the position-dependent term, 164 respectively.

165
The term m 0 (x 0 |r 1 , θ 0 ) for binding to the zero-th repeat may depend on the first 166 RVD on the TALE, since dependencies between zero-th and first repeat have been 167 observed before [23]. However, our knowledge about such dependencies is limited to the 168 data presently available and, hence, we limit the RVDs for which a dependency is 169 considered to a set R 0 . Our data regarding systematic, quantitative analyses of the base 170 preference of the zero-th repeat is limited in general, although it is widely assumed that 171 position 0 in target boxes of natural TALEs is preferentially T and less frequently C. 172 We include this prior knowledge into a-priori parameters π x0 .

175
The term m 1 (x 1 |r 1 , θ 1 , θ m ) for binding to the first repeat depends on the 13th AA 176 r 1,13 of the first RVD r 1 , but may be extended by additional terms that either model a 177 general dependency on the complete first RVD (including the 12th AA), and/or a 178 separate base preference for a given 13th AA at the first position. Again, this 179 modularity allows us to adapt the model to the resolution of data available, since a 180 substantial part of RVDs is only covered by the systematic but limited data reported 181 in [24,25].
In this paper, we set  The term m(x |r −1 , r , θ m ) for binding to the remaining repeats again depends on 184 the 13th AA r ,13 of the current RVD r , but may be extened by additional terms that 185 either model a dependency on the complete RVD (with parameters shared with the 186 correponding term used for the first RVD), and/or the complete RVD r at the current 187 repeat and the 12th AA r −1,12 at the previous repeat: In this paper, we set R 3 = {HD, N N, N G, N I}.
The parameters θ p,a,1 and θ p,a,2 denote the slopes, and θ p,b,1 and θ p,b,2 denote the 193 location parameters of the logistic functions.

194
Learning parameters

195
The training data D = (t 1 , . . . , t N ) comprise tuples t i = (r i , x i , v i , w i , g i ) of TALE RVD 196 sequence r i , target box x i , target value v i , global weight w i and group g i (cf. sections 197 "Data" and "Model"). Given the current parameter values θ, we may further compute 198 for each pair of TALE and target box, the corresponding model score s i = s(x|r i , θ i ).

199
The goal of the learning process is to adapt the parameter values θ such that the scales. Hence, we allow the learning process to linearly transform the computed scores 204 s i before comparing them to the target values. The total error between target value and 205 prediction score is defined as β = (β a,1 , β b,1 , . . . , β a,G , β b,G ), β a,gi and β b,gi are group-specific scale and shift 208 parameters, respectively, and G is the total number of groups in the data set D.

209
In addition, we use an L 2 regularization term on the model parameters θ to avoid 210 overfitting and explosion of parameter values: where the regularization parameter λ is set to 0.1 in this paper.

212
The number of model parameters for the different terms varies greatly, depending on 213 the number of conditions (e.g., 12th AA of previous RVD, separate parameters for 214 individual RVDs). This regularization also has the effect that more complex dependency 215 parameters assume values considerably different from 0 only if the modeled specificity 216 cannot be captured by the less complex sets of parameters.

217
The final objective function is then to minimize sum of the error term E(θ; D, β) 218 and the regularization term L 2 (θ) with respect to the parameter values: Parameter optimization is performed by a gradient-based quasi-Newton method as 220 implemented in the Jstacs library [20].

221
The final parameters θ * of the trained model may then be used to determine 222 prediction scores of previously unseen pairs of TALEs and putative target boxes, 223 whereas the value of β * is discarded after optimization.

Prediction of TALE target boxes 225
For predicting putative TALE target boxes for a given TALE with RVD sequence r of 226 length L, we follow a sliding window approach scanning input sequences x 1 , . . . , x N .

227
Input sequences could, for instance, be promoter sequences of annotated genes but also 228 complete chromosomes. Each sub-sequence x i, , . . . , x i, +L then serves as input of the 229 model to compute the corresponding score s(x i, , . . . , x i, +L |r, θ * ). To allow for a rough 230 comparison of scores, even between TALEs of different lengths, we normalize this score 231 to the length of the input sequence, i.e., we compute a normalized score as The scanning process explicitly accounts for aberrant repeats, which may loop out of 238 the repeat array [6]. To this end, we search for putative target boxes with all repeats 239 present in the repeat array, but also all combinations of aberrant repeats removed from 240 the RVD sequence. Due to the normalization of scores by the number of repeats, 241 predictions based on these modified RVD sequences can still be ranked in a common list. 242 In addition, we provide a box-specific p-value as a statistical measure for the 243 significance of target box predictions. Those p-values may either be computed from a 244 dedicated background set of sequences or from a random sub-sample of the scanned 245 input sequences, where the latter option is used throughout this paper. In either case, 246 scores are computed for the sub-sequences given the current RVD sequences, then a 247 Gaussian distribution is fitted to those score values, and the p-value for a given score is 248 determined from that Gaussian distribution. While the Gaussian distribution does not 249 perfectly fit the true distribution of score values, it allows for computing p-values with 250 high resolution (as opposed to just using percentages of the scores themselves) and even 251 for score values larger than any of the scores in the random sample.  averaging methods may be adjusted by user parameters.

273
For each predicted target box, a profile output is generated if there is at least one 274 differential expressed region with a minimum length of 400 bp that does not overlap the 275 target box, or if it overlaps, the differential region starts or ends at most 50 bp 276 upstream or downstream of the target box.

277
The obtained profiles may be visualized using an auxiliary R script. In addition to 278 the profile data, this R script requires annotations data of already known transcripts in 279 gff3 format. By this means, users may then investigate whether the predicted binding  [20] and will be part of the next Jstacs release.

293
For scanning large input sequences, e.g., complete genomes of host plant species, an 294 acceptible runtime is essential. Since the parameters at each position of the proposed 295 model depend on the RVD sequence of the TALE of interest but do not include 296 dependencies between different nucleotides of a putative target box, we may convert the 297 model given a fixed TALE RVD sequence into an position weight matrix 298 (PWM) [32,33]. This allows for a quick computation of prediction scores that may be 299 formulated as the position-wise sum of values stored in the TALE-specific PWM model. 300 We further speed-up the scanning process by pre-computing indexes of overlapping 301 k-mers in the same manner as proposed for the TALENoffer application earlier [34].

303
We compare the performance of the approach presented in this paper to those of 304 established tools for predicting TALE target sites, namely TALESF [14], Talvez [16], 305 and TALgetter [18], based on RNA-seq data after inoculation with different Xoo and 306 Xoc strains described above.

307
To this end, we collect the promoter sequences of all transcripts based on the MSU7 308 assembly and gene models [35] available from 309 http://rice.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/ 310 annotation_dbs/pseudomolecules/version_7.0/all.dir/. We consider as 311 promoter the sequence spanning from 300 bp upstream of the transcription start site to 312 200 bp downstream of the transcription start site or the start codon, whichever comes 313 first, as proposed before [18]. We then run each of the tools using default parameters on 314 the extracted promoter sequence providing the RVD sequences of the TALEs present in 315 the respective Xanthomonas strain. Predictions in promoters of different transcripts 316 belonging to the same gene are merged by considering only the prediction yielding the 317 best prediction score.

318
Assessment of prediction performance based on in-planta inoculation experiments 319 with Xanthomonas strains harboring multiple TALEs has the inherent complications 320 that i) putative target genes cannot be attributed to one specific TALE based on the 321 RNA-seq data alone and ii) genes showing increased expression after inoculation may 322 either be regulated directly by a TALE binding to their promoter or indirectly via other, 323 regulatory target genes. Hence, we define true positives as those genes that have a 324 predicted target box in their promoter and are also up-regulated after inoculation with 325 the respective Xanthomonas strain relative to control as derived from RNA-seq data.

326
By contrast, we cannot clearly define false negatives, since genes that are up-regulated 327 after inoculation but do not contain a predicted target box in their promoter could be 328 indirect target genes. False positives, in turn, would be genes with a predicted target 329 box in their promoter that are not up-regulated after Xanthomonas inoculation.

330
A further issue hampering performance assessment by standard methods like receiver 331 operating characteristic (ROC) [36] or precision-recall (PR) curves [37,38] is that for 332 two of the tools considered (TALESF and Talvez), none of the reported prediction 333 scores is comparable between different TALEs, especially TALEs of different lengths.

334
Hence, we decide to use varying cutoffs on the number of predicted target genes per 335 TALE to establish a common ground for comparing all four approaches.

336
Following these considerations, we collect for each of the four approaches the number 337 of true positive predictions (TPs) for cutoffs on the number of predictions per TALE 338 from 1 (i.e., the top prediction) to 50. We then plot for each approach the number of 339 true positives against this cutoff to obtain a continuous picture of its prediction 340 performance. In addition, we collect for the same cutoffs the number of TALEs with at 341 least one predicted target gene among the true positives.

342
The area under these curves may serve as a further measure of general prediction 343 performance in analogy to, for instance, the area under the ROC curve. 344 Finally, we compare the TPs at distinct cutoffs (1,10,20,50) between the four tools. 345 For a specific cutoff, we collect the TPs (or, in analogy, number of TALEs with at least 346 one predicted target) for each of the four tools. Statistical significance of the differences 347 in observed TPs is then assessed by a Quade test [39]  In this section, we benchmark the predictions of PrediTALE against those made by one 360 of the previous approaches, namely Target Finder [14], Talvez [16], and TALgetter [18]. 361 To this end, we consider different Xanthomonas oryzae pv. oryzae (Xoo) and RNA-seq data [9]. For the TALEs from the repertoires of these three Xoo and ten Xoc 368 strains, we determine target gene predictions for each of the previous approaches and 369 for PrediTALE. Predicted target genes are ranked by the corresponding prediction 370 scores of the different approaches per TALE.

371
First, we study the overlaps between the sets of predicted target genes per approach 372 to investigate how strongly predictions are affected by conceptual differences of these 373 approaches. In Figure 1A, we show Venn diagrams of predicted target genes for the 374 three Xoo strains based on the top 20 predictions per TALE, while the corresponding 375 diagrams for the ten Xoc strains are available as Supplementary figure S1. In general, 376 we observe a substantial number of unique predictions for each of the four approaches, 377 but especially for Talvez and PrediTALE. By contrast, the overlapping predictions 378 between all four approaches amount to less than a quarter of the total predictions per 379 approach. This demonstrates that prediction results strongly depend on the employed 380 approach. However, prediction accuracy cannot be assessed without an experimental 381 knowledge about genes that are up-regulated in planta upon Xanthomonas infection.

382
For this reason, we filter predictions based on the corresponding RNA-seq data in the 383 following.

384
RNA-seq data for the three Xoo strains including previously unpublished data for 385 PXO83, have been collected 24 hours after infection. Collection at this early time point 386 has the advantage that the number of secondary targets, i.e., genes that are up-regulated 387 as a secondary effect of direct TALE targets with regulatory function, should still be 388 low. However, as the infection might not be fully established, yet, the variation between 389 replicates and, hence, the number of significantly differentially expressed genes based on 390 standard FDR-based criteria is rather low (cf. Supplementary table A). As we aim at 391 sensitivity for the benchmark study, i.e., we want to avoid predictions to be erroneously 392 counted as false positives, we consider genes as differentially up-regulated if they obtain 393 an uncorrected p-value below 0.05 and are at least 2-fold up-regulated in this case,  The results presented so far strongly depend on the thresholds of the ranks of the 418 target predictions but also on the thresholds applied to the RNA-seq data. To address 419 the former problem, we aim at an assessment of target predictions over all rank 420 thresholds, while we will handle the latter by separate evaluations applying different 421 criteria to the RNA-seq data.

422
As detailed in section "Evaluation of prediction results", standard performance 423 measures like the area under the ROC curve [36] or the area under the precision-recall 424 curve [37,38] are inappropriate under this setting. Briefly, we cannot attribute an  (3). This is also reflected by the AUC values, which rank the approaches in 452 the order of PrediTALE, TALgetter, Talvez and Target Finder. 453 We take a different perspective on prediction results by assessing prediction 454 performance on the level of TALEs. Specifically, we count the number of TALEs with at 455 least one true positive target prediction for the same rank cutoffs as before. Again, 456 PrediTALE identifies targets for a larger number of TALEs than the other approaches 457 for the majority of rank cutoffs (Figure 3). However, we see notable differences between 458 the different Xoo strain. For ICMP 3125 T , PrediTALE is able to identify putative 459 targets for 10 of its 17 TALEs. By contrast, the number of TALEs with a true positive 460 prediction is lower for PXO142, where PrediTALE finds targets for at most 7 out of 19 461 TALEs, and for PXO83, where PrediTALE find targets for at most 7 out of 18 TALEs. 462 As ICMP 3125 T has also been the strain with the largest number of differentially    Although it has been shown that TALEs may activate transcription in both strand 473 orientations relative to the transcription start site (TSS) of target genes [12,13], a 474 preference for the forward orientation has been postulated [13]. This is reflected by the 475 strand penalty of PrediTALE, but no similar parameter exists for the previous  . Table I). We also find an improved performance for the majority of the remaining 488 rank cutoffs and Xoc strains. This improvement is especially pronounced for strains Xoc 489 BLS279, CFBP7331, CFBP7341, and L8, whereas PrediTALE performs similar to or 490 slightly worse than at least one of the previous approaches for Xoc CFBP7342 and  approaches and fixed rank cutoffs of 1, 10, 20, and 50, and for the area under the curve 502 both on the level of target genes and on the level of TALEs (Table 1 and Supplementary 503 tables I and J). For all rank cutoffs and the area under the curve, we observe that 504 PrediTALE yields the best average rank with values betwen 1.1 and 1.5. We further 505 assess the statistical significance of differences between the different tools by a Quade 506 test, and the pairwise differences between tools by the associated post-hoc test (see 507 Methods). This assessment is partly limited by the fact that pairs of Xoc strains may 508 have identical TALEs in their TALEomes, which also means that the performance values 509 of those strains are not truly independent. However, we did not find a clear relationship 510 between the similarity of performance values obtained for the different strains and the 511 similarity of the corresponding TALEomes. For this reason, we consider this dependency 512 rather mild and favor this limited statistical assessment over the complete lack of it.  figures S12 and S13, Supplementary tables Q, R, and S), benchmarking results are 525 essentially similar to our previous findings. One notable exception is the Quade test for 526 rank 1 predictions restricted to the forward strand (Supplementary table S), which is no 527 longer significant. This means that none of the approaches studied yields significantly 528 better rank 1 predictions than any other under this scenario.

529
In summary, we find i) that PrediTALE produces several unique predictions that  In Table 2, we collect further information about those target genes including the 545 corresponding log fold change and prediction ranks for all four approaches.

546
The target genes in the intersections of the predictions of all four approaches 547 comprise several well known targets: Os09g29820 (OsTFX1), a bZIP transcription 548 factor, is targeted by TALEs from class TalAR with members in all three Xoo strains 549 (Supplementary figure S14) and has been proposed as a TALE target early [5,42].

550
Os01g40290 [5], an expressed protein without annotated function, is targeted by TalAA 551 members, which are also present in all three Xoo strains. However, this gene is not in 552 For each tool and each measure (TALEs/Genes; rank cutoff), we report the average performance rank per tool, the significance of the Quade test (*:< 0.05; **:< 0.01; ***:< 0.001), and the significance of the pairwise comparison in a post-hoc test. Here, '+' and '-' indicate that the first tool has gained a significantly better or worse performance than the second one, respectively. The number of symbols encodes the significance level in analogy to the Quade test.
the list of predictions for Xoo PXO83, because the corresponding p-value was larger 553 than the threshold of 0.05 (Supplementary table T). Os01g73890 (TFIIAγ) [5], that has 554 been shown to promote TALE function [43], is targeted by TalBM2 in ICMP 3125 T . In 555 concordance to TalBM class members missing in PXO142 and PXO83, Os01g73890 556 shows no up-regulation in these two strains. Os07g06970 (HEN1) has also been among 557 the first TALE target genes proposed [5] and is targeted by TalAP members present in 558 all three Xoo strains, but falls below the threshold on the log fold change by a small 559 margin in ICMP 3125 T (Supplementary figure S15). Os06g29790 [18], a phosphate 560 transporter, has been predicted as a target of TalAO16 from PXO142 by all four 561 approaches, but not for the TalAO members from PXO83 and ICMP 3125 T , which have 562 a slightly different and longer RVD sequence. In the RNA-seq data, however, we find 563 the strongest up-regulation for ICMP 3125 T , although Os06g29790 appears only on 564 rank 49 of the PrediTALE predictions for TalAO15 from that strain. Hence, 565 experimental data and computational predictions are partly contradictory in this case. 566 In addition, we find several putative target genes in the intersection that have not 567 been reported before: Os02g06670, a retrotransposon protein, is predicted as a target of 568 TalBA8 and TalBA2 in ICMP 3125 T and PXO83, respectively, whereas PXO142 lacks a 569 TalBA member. Nonetheless, Os02g06670 is up-regulated after PXO142 infection, 570 although to a lesser degree than in the other two strains (cf. Supplementary figure S15). 571 Os11g26790 (RAB21), a dehydrin that has been shown to play a role in drought 572 tolerance related to pathogen infection [44], is predicted to be targeted by TalAH11 573 from ICMP 3125 T . Os11g26790 is up-regulated for ICMP 3125 T but also for PXO142 574 (Supplementary figure S15), although in the latter case, the corresponding p-value is 575 again not significant. Os02g49350, a plastocyanin-like protein, is strongly up-regulated 576 only in PXO142 and predicted as a target of TalBH2, where class TalBH is exclusive to 577 PXO142 among the strains studied. gene is among the top 20 predictions of PrediTALE only for TalAQ3 in PXO83 due to 592 differences in RVD sequence. In PXO142, TalAQ15 is annotated as a pseudo gene and 593 this pattern is also reflected by the RNA-seq data. Os03g03034 has been proposed to be 594 a TALE target before [5].

595
Os04g05050, annotated as a pectate lyase, is only among the top 20 predictions of 596 PrediTALE in ICMP 3125 T (TalAB16) and PXO83 (TalAB5), whereas this gene is 597 ranked substantially lower (rank 83) for TalAB8 from PXO142 by PrediTALE as well. 598 From the RNA-seq data, we find that Os04g05050 is up-regulated in all three Xoo 599 strains, although the level of up-regulation is lower for PXO142 than for the other two 600 strains.

601
Os05g45070, annotated as hairpin-induced protein 1, is predicted only by PrediTALE 602 as an alternative target of TalAO15 in ICMP 3125 T and shows clear up-regulation only 603 after infection with this Xoo strain. Os10g28240, a calcium transporting ATPase, is 604 predicted by TALgetter and PrediTALE as target of TalAR13 of ICMP 3125 T but, on 605 later ranks, also by the other two approaches, and is up-regulated exclusively after 606 ICMP 3125 T infection. Os09g07460, a kelch repeat protein, is only among the top 20 607 predictions of Talvez for TalBA and on later ranks for the other approaches. This gene 608 is up-regulated only in ICMP 3125 T , although not strongly.

609
For PXO142, we find two further putative targets of TalBH2 that are predicted 610 exclusively by PrediTALE: Os03g09150 (pumilio-family RNA binding) is up-regulated 611 in PXO142 but also in PXO83, for which it does not appear among the top 20 612 predictions of any approach. Os11g31190 (Os11N3, OsSWEET14) is a well known 613 target [45,46], which is predicted here also for TalBH exclusively by PrediTALE due to 614 its ability to adequately handle the aberrant repeat [6] of TalBH2. Os11g31190 is also 615 known to be targeted by TalAC members (previously termed AvrXa7) [42] including 616 TalAC5 in PXO83 and, hence, is strongly up-regulated after PXO83 infection as well.

617
However, in this case all approaches fail to predict this target due to the large number 618 of mis-matches in the target box [6], even accounting for the aberrant repeat in TalAC5. 619 Instead, another retrotransposon protein (Os04g19960) is the top prediction of 620 PrediTALE for TalAC5 from PXO83, which is confirmed by RNA-seq data as this gene 621 is strongly up-regulated after PXO83 infection but not after infection with one of the 622 other strains.

623
In summary, we find several novel putative target genes of which 6 are highly individual TALEs [47]. :

632
We also observe from Figure 3 and Supplementary figure S7 that for many strains, 633 neither of the approaches considered is able to identify a putative target genes for all 634 TALEs present in their TALEome. We term such TALEs without reasonable target 635 prediction orphan TALEs, and we will discuss these in more detail in the following.

636
More precisely, we call a TALE or a TALE class orphan if there is no up-regulated 637 gene among the top 50 predictions of any of the four approaches. Furthermore, we check 638 if this pattern is consistent for the TALEs from a common TALE class across almost all 639 Xoo and Xoc strains studied. 640 We find as orphan the TALE classes present in all three Xoo strains TalAF, TalAI  [48].

648
In the Xoc strains, however, TalAF is not orphan as we find putative target genes  For each Xoo strain, we list the gene ID (MSU7) and the log fold change (lfc) in the corresponding RNA-seq experiment. For each of the four approaches, we further list the TALE(s), for which a gene has been predicted as a target and in parentheses the corresponding prediction rank. An "NA" entry for a combination of gene and prediction approach indicates that this gene has not been among the top 1000 predictions for any TALE. Xoo and Xoc strains.

675
After infection with Xoo strains, 14 TALEs are found to have differentially expressed 676 regions near at least one predicted target box. Table 3 lists the total number of 19 677 TALE target boxes together with MSU7 gene annotations overlapping the differentially 678 expressed regions. Notably, 15 of these targets have already been reported in subsection 679 "PrediTALE predicts novel putative target genes" when restricting the search to 680 promoter regions of annotated genes. However, for two genes, target boxes from other 681 TALs were predicted in case of genome-wide scan. The expression of the pectate lyase 682 precursor (Os04g05050) was up-regulated by TalAB5 according to promotor prediction, 683 but the genome-wide prediction contains the same gene up-regulated by TalAD22. The 684 same scenario for the phosphate transporter 1 (Os06g29790), which according to 685 promotor predictions is up-regulated by TalAO16 and TalAP15. However, in the 686 genome-wide scans, a target box of TalAH11 was predicted. The genome-wide scan i) 687 does not make use of gene annotations, and ii) could be expected to be more prone to 688 false positive predictions than the restricted search in promoters. Hence, the fact that 689 many predictions re-occur in the genome-wide scan demonstrates the general utility of 690 this approach.

691
In addition to those targets reported previously, we find three novel target boxes in 692 the vicinity of differentially expressed regions that overlap annotated genes, including a 693 wound-induced protein and an oxidoreductase. For TalAO16 from PXO142, we find a  Table W). For this reason, we extracted the sequence under the differentially expressed 697 region, and first compared it against the NCBI protein database 'nr' using blastx but 698 received no matching result. We additionally compared this sequence against the NCBI 699 reference RNA sequences (refseq rna) using blastn, which resulted in a highly significant 700 hit for XR 001547425.2, a predicted long non-coding RNA.   TalAA15  Chr1  22747303  Os01g40290  expressed protein  yes  TalAD22  Chr3  29685233  Os03g51760  OsFBX109 -F-box protein  yes  TalAD22  Chr4  2486797  Os04g05050  pectate lyase precursor  TalAB5  TalAH11  Chr6  17129738  Os06g29790  phosphate transporter 1  TalAO16,  TalAP15  TalAN14  Chr2  31931460  Os02g52170  expressed protein  no  TalAN14  Chr8  19950534  Os08g32160  oxidoreductase, 2OG-FeII oxygenase  no  TalAR13  Chr10  14685398  Os10g28240  calcium-transporting ATPase  yes  TalAR13  Chr9  18123472  Os09g29820  bZIP transcription factor  yes  TalBA8  Chr2  3353526  Os02g06670  retrotransposon protein  yes  TalBM2 Chr1 42819000 Os01g73890 transcription initiation factor IIA gamma yes PXO142 TalAO16  Chr7  22546154  --NA  TalAR14 Chr5 In the following, we will discuss two example regions in detail. As discussed in the 708 previous section, TalAZ appears to be an orphan TALE based on the promoterome-wide 709 scans for target boxes. However, based on genome-wide scans, we find a differentially 710 expressed region, which could constitute a target gene of TalAZ, on Chr4 ( Figure 5).

733
Accurate computational predictions of TALE target boxes are required for elucidating 734 virulence targets of TALEs that support bacterial infection of host plants. In this paper, 735 we present PrediTALE, a novel approach for predicting target boxes based on a TALE's 736 RVD sequence. Since the publication of all previous approaches [14,16,18], our 737 understanding of mechanisms and principles of TALE targeting has increased 738 substantially. Specifically, it has been shown that repeats of aberrant lengths may 739 compensate for frame shifts in target boxes [6], that activation of gene expression by 740 TALEs binding to the reverse strand is possible, but rare [13]. In addition, quantitative 741 data about virtually all combinations of AAs at RVD positions have been 742 collected [19,[21][22][23][24][25]. All these insights have been integrated into PrediTALE either as 743 part of the model or as training data that are used to adapt model parameters. Here, 744 we demonstrate that PrediTALE predicts TALE targets with improved accuracy 745 compared with previous approaches, where ground truth is derived from in-house and 746 public RNA-seq data after Xoo and Xoc infection. However, our results also confirm 747 that any of the current computational approaches suffers from false positive predictions 748 and, hence, experimental support of predicted targets is inevitable.

749
PrediTALE predicts several unique target genes, several of which are highly 750 promising for further experimental validation. While RNA-seq data supports that these 751 are activated by TALEs in planta, their importance for the infection process still needs 752 to be investigated.

753
Given the improved accuracy and acceptable runtime of PrediTALE, we broaden the 754 scope of computational predictions. Previously, predictions have been mostly limited to 755 putative promoter regions of annotated genes. Here, we consider genome-wide 756 predictions instead. We demonstrate that targets reported from promoterome-wide 757 predictions are also recovered in genome-wide scans, but we also find differentially 758 expressed regions at loci that do not overlap with annotated genes. These could be 759 either protein-coding genes that are missing from the current annotation, but also 760 include putative non-coding RNAs, which might have regulatory activity or other 761 functions that foster bacterial infection.

762
To promote future research in plant-pathogen interactions related to TALEs, we 763 make our methods available to the scientific community as open-source software tools. 764