MALTA: a calculator for estimating the coverage with shRNA, CRISPR, and cDNA libraries

Genetic screens using shRNA, CRISPR, or cDNA libraries rely on adequately transferring the library into cells for further assay. These libraries can have many different elements and each element can be present at different copy numbers within a given pooled library. Calculating how many recipient cells are needed to adequately sample all or most of the different elements within a library is important, especially if one wants to compare the outcomes of different genetic screens that rely on accurately reproducing the starting population of library-containing cells. Here we present a simple application that starts with a list of library elements and their abundance and calculates the minimum sampling number to achieve full transfer of the library to an acceptor cell population to a user-specified level of probability. Users can adjust several input parameters including designating a subpopulation over which the calculation is made. Finally, the program performs a series of Monte Carlo simulations of a user-specified number of picks to produce an empirically determined distribution of each library element.


Motivation and Significance
In molecular biology, a library is a collection of genetic elements that are used in different types of experiments to understand the how and what genes function in a biological process. Many libraries are composed of large collections of elements housed within plasmid vectors, which are fragments of double-stranded DNA that replicate independently from chromosomal DNA and are easy to manipulate in the laboratory. Individual plasmids can be engineered to express a specific RNA that in turn can be translated into a protein, or an RNA that can form part of an enzyme complex that can destroy specific genes (e.g. CRISPR/Cas9 guide strand gRNA) or suppress their expression via RNA interference (e.g. small hairpin shRNA) [1][2][3][4]. A library is formed when specific plasmids that encode different proteins or RNAs are collected together. These libraries then provide important tools to survey the function of different genes, enabling a number of genetic screens where each element (e.g. plasmid) in a given library can be assessed for its ability to alter a particular biological process when introduced into individual cells. Libraries basically come in two different configurations, arrayed and pooled. When a library is arrayed, each element in the library (e.g. a plasmid expressing one particular gene) is housed separately in different tubes that exist within an array. By systematically assaying each and every different element from the array, one can reliably assay all the elements in the library. The other configuration is to mix the different plasmids together in a pool. Working with pooled libraries is often easier and far less expensive. When all the elements of a library are introduced into cells, they can then be subjected to a genetic screen or genetic selection to retrieve a subset of library elements that perturb a subset of genes relevant to the biological process being studied [5][6][7][8]. Pooled libraries can have many different individual elements (e.g. 1×10 5 -1×10 7 ) and multiple copies of each individual element are present. Pooled libraries rarely have all elements present at equal abundance, however, making it difficult to calculate how all the types of elements/ plasmids in a given library can be adequately sampled.
Comprehensive genetic screens rely on experimental conditions that ensure that all the elements in a library are sampled. This is especially true when the results of genetic screens are compared, such that same library is screened using two or more different conditions. It is also important if conclusions are to be made about the functional role of certain genes whose corresponding elements in the library are not identified in a genetic screen. For instance, if library elements are not found in a genetically selected subpopulation, one cannot conclude that they are irrelevant to the biological function being studied. This is because an alternative explanation is that those library elements were not initially present and sampled in the genetic screen. One parameter important for the design of such screens is to properly estimate how many cell recipients (e.g. transformants, transfectants or transductants) need to be generated so that the entire repertoire of elements within the library are represented in the starting population [6,9,10]. Without knowing the content of a library it is easy to make an inappropriate estimate [6,9,10]. Many of the current approaches that exploit these libraries incorporate analysis with high-throughput, massively parallel, deep sequencing that can quantitatively monitor the relative levels of each library element [1,2]. High-throughput sequencing can also be used to characterize the initial plasmid or virus library, providing the identity and abundance of each element, and thus provide the information required to calculate an adequate starting size of library recipient cell population. Making that calculation becomes complicated when each element of the library is present at different levels. For instance, with a library of only 20 different elements, where each is equally abundant, it takes 135 picks to have at least a 98% chance of obtaining at least one of each. Yet if 1 of those elements is 100 times more abundant than another, it would take 812 picks. When considering a library with thousands of elements whose abundance can range over 4-5 orders of magnitude, the calculations become more complex as they also become more critical to the success of the experiment.
Here we provide a simple-to-use software program that performs these calculations, thus enhancing the ability to conduct comprehensive genetic screens that use complex genetic libraries of known content.
The central goal was to calculate how much to sample a library to be able to have at least one of every element to within a given level of probability. An example would be calculating how may cells, each carrying a single library element, need to be generated such that there is a 98% chance that every element of that library was transferred to the population of cells. This principle is articulated in the 'coupon collectors' statistics problem as solved by Von Schelling [11]. Here, a collector has access to a large randomized pile comprised of k different coupons. One question is how many coupons need to be picked from the pile to get one of a particular kind. When each coupon is equally abundant, the chance of obtaining a particular coupon for each round of selection is 1/k whereas the chance of not getting that coupon ( j) is 1 − 1/k . As the number of selection rounds or picks increases, the chance of not picking a particular kind of coupon decreases ( j n ) just as the chance of obtaining at least one of that kind of coupon increases (1 − j n = 1 − 1 − 1 k n ) . When the proportion of the different coupons are different, the probability of picking a particular element is not given by 1/k but rather the proportion of that type of element within the overall collection. Thus, the chance of picking a particular plasmid that accounts for 25% of the total population of plasmids in a collection with n picks is given by ( The problem of finding how many picks are needed to obtain at least one of every coupon is more difficult to compute because for each pick, the chance of getting one type of coupon or plasmid versus a different kind is non-mutually exclusive. Summing probabilities for not getting each element in n tries over-counts the number of picks needed for obtaining the full collection. This is because for a given round of selection, when one does not pick a particular element, the chance of picking another element goes up. Thus, for an exact solution, one needs to use the inclusion-exclusion principle in set theory to accurately calculate the union of probabilities for not getting one of each element, the complement of which is the probability of obtaining a full collection. For k elements in the collection (K i , K j , K k , K n ) This calculation is very computationally intensive and impractical when the number of items is on the order of 10 3 or beyond. Yet the number of items or elements in a typical cDNA, shRNA, or gRNA library is well above 10 3 . A current desktop computer running algorithms that have been refined for making such calculations on collections with uneven probabilities are easily overwhelmed with collections beyond 10 2 elements [12]. Therefore, we made a simple GUI-driven (Graphics User Interface) stand-alone software that performs multiple Monte-Carlo (MC) simulations of a user specified number picks from a given collection (e.g. library) where the proportion of each element is known. This yields an empiricallydetermined mean and standard deviation for each element picked from the total collection. The program also provides an estimated number of picks that it would take to have one of each element in the collection, serving as a starting point for specifying how many picks the MC simulation should conduct.

Software description 2.1 Estimate for the number of Picks
The MC simulation process is good at performing multiple selection experiments with a user-specified number of picks (n), but it is too computationally cumbersome to iterate through increasing n to arrive at an acceptable number that would yield full coverage. Thus, MALTA is equipped to make an estimate for n that can then be used to specify how many picks will be performed by the MC simulation. Two ways are used to make this estimate ( Figure 1). One is to simply treat the probability of obtaining each element after n picks as mutually exclusive possibilities analogous to a naïve Bayes classifier model. Although this overestimates the minimal n, using this n as a first approximation is easily verified empirically using the MC simulation function. In addition, the error incurred by treating the selection of every element as a mutually exclusive probability diminishes with increasing number of elements that each have a low proportion in the population. Indeed, this is precisely the situation for complex shRNA or CRISPR gRNA libraries that have 10 4 -10 7 elements. This method can estimate the probability of obtaining at least one of each element from the collection (e.g. library) with a given number of picks. MALTA terms this 'Completeness of Items', and calculates this by increasing the number of picks until the sum of non-exclusive probabilities reaches the user specified target entered into the 'Coverage' field. A second way is to use a principle of Heaps' law as described by Ferrante and Frigo [13]. Here, instead of estimating the number of picks to achieve a given probability of obtaining all elements in a collection, the estimate is rather how many picks is required to obtain a given percentage of available elements. How these estimation methods compare with an exact solution calculated using previously described computational method [9] are compared in Figure 1. MALTA reports on both of these estimate but suggests the estimated number of picks based on the former case of summing non-mutually exclusive probabilities. This yields a rough estimate of an appropriate number of picks that can then be used to initiate the MC simulation.
The MALTA procedure for estimating 'Completeness' of Items is outlined below: Step 2. If X is greater than: 1-(j.Gene1 n + j.Gene2 n + jGene3 n + jGene4 n ) increase n 2-fold and Repeat Step 2.

File Format
MALTA was written in C++ and provides a simple graphic user interface. Versions of the program are available that run on Linux, MacOS, and Windows. The input file is a commaseparated-values (.csv) formatted file in which the first row should contain following headings for Column A: 'Item type', and Column B: value. Values can be raw counts of each item or the percentage or part per million of each item in the population ( Figure 2). Either will work since MALTA will determine the sum of values for all items and calculate the individual probabilities for each item. A menu command in MALTA allows for the import of the file using normal operating system standards Once the file is loaded, a display of the data in bar graph form shows the rank order of abundance and the log 2 -transformed abundance in the main window. The user can then adjust the level of 'Coverage'. MALTA uses the 'Completeness of Items' calculation to find the approximate number of picks it will take to achieve obtaining a full set of elements with the probability specified by the value entered by the user under 'Coverage' (Figure 3A feature c). The highest value is 99%. For every adjustment, MALTA will estimate the number of picks (n) and then use that n to calculate in 'Completeness of Items', which determines the overall probability of picking at least one of all items, assuming that each is determined by an independent probability. The estimated n is then used to automatically populate the dialog window, which sets the number of picks that will be subjected to Monte Carlo simulations. Users can override that estimate, and populate the 'Sim Picks' field with their own number of picks.

Adjusting subsets of input data
Some complex libraries may have a number of elements that are of exceedingly low abundance in proportion to the other elements in the library. To pick one of each of these elements, a very large number of picks would have to be employed, and that may be not be feasible experimentally. Thus, for the purposes of the calculation, it may be desirable to simply bin these low abundance elements together, which would allow this group of lowabundance elements to be treated as a single probability and cause the 'Completeness of Items' estimate to be lower. Here the population is unaltered, however, MALTA does not estimate the picks needed to get each individual item found within the binned sub-threshold population. To make these adjustments, MALTA has included sliders ( Figure 3A feature a) to act as 'Population Curtains', effectively binning higher abundance and/or lower abundance items together allowing MALTA to ignore having to get each one of the binned items. This is shown in Figure 3B where the Population Curtains are adjusted to exclude high abundance and low abundance items. While the same input data is used as for Figure  3A, only the highlighted items are treated as individual items for which the calculated picks and probabilities are considered. There are two tabs in the upper righthand corner termed 'Input Data' and 'Monte Carlo Counts' ( Figure 3C). The input data simply displays the data extracted from the .csv file and labels it according to the user specified header in the top row of the .csv file. The Monte Carlo (MC) Counts shows the mean and standard deviation for the number of counts each item achieved. Some populations can have a large variance in the abundance of each item such that there are many very low abundance items.
As an alternative to the Population Curtains, there is the 'Subpopulation Cutoff' feature ( Figure 3 item b). Unlike the Population Curtains, which includes all the input data to calculate an estimate but bins some elements together, the 'Subpopulation Cutoff' extracts a subset of the data to calculate and estimate for number of pics. Here one can set a threshold value such that any element under that threshold is eliminated from the estimate for number of picks. By raising this threshold, the estimated number of picks is reduced. Importantly, this operation ignores data from elements below the specified threshold, affecting not only the estimate but also any Monte Carlo simulations. Users can specify a value up to 90% of the most abundant element, a limit that avoids eliminating the entire set of elements on which to perform subsequent calculations. The full dataset can be restored by resetting the 'Subpopulation Cutoff' to zero.

Monte Carlo simulation
All of the above user inputs described above help generate an estimate for the number of picks that MALTA could then use to run a series of Monte Carlo simulations. Each of these operations will automatically adjust the number found in the MC Picks field ( Figure 3A feature d). Yet, users can specify a different number of picks to be subjected to MC simulations by entering a number into the 'Sim. Picks' field. Users can also specify the number of MC simulations by entering a number between (1-1000) in the 'Sim.Runs' field. When 'Simulate' is clicked, MC simulations are performed and results are expressed as the average across all MC simulation runs and a standard deviation of that set is also given. These results are accessed by clicking the 'MC Count' tab. The MC simulation computation is based on a stochastic acceptance-based algorithm [14] . In pseudo code the computation is: Sim.Picks = User specified number of picks per MC simulation to make.
Sim.Runs = User specified number of MC simulations to make. Step 3. Create Random number between 0 and 1.
Step 4. Find gene that has the lowest C.Prop that is also more or equal to the random number.
If the random number is 0.5, the corresponding match is Gene2 If the random number is 0.75, the corresponding match is Gene3 Step 5: Increment the corresponding Gene count by 1.
Step 6: Repeat Steps 3 to 5 for 'Sim.Picks' number of times to complete 1 simulation Step 7: Repeat Step 6 for 'Sim.Runs' number of times to complete multiple simulations Step 8: Calculate average and standard deviations for each Gene Count across multiple simulations.
The stochastic acceptance Roulette-wheel selection algorithm implemented in MALTA allows the MC simulations to be efficiently computed. This algorithm is more efficient and less memory intensive than linear search algorithms that search an array populated with multiples of each element in proportion to their abundance in the population, which can create very large arrays when large populations with a wide range of proportions are analyzed [14] . MALTA completed 50 MC simulations of 1.5×10 7 picks each on a library of 10,000 elements whose abundance ranges across 4 orders of magnitude in about 2 minutes when run on a 3.2 GHz Intel Core (quad) i5 processor.

Illustrative example
Included in the repository for the MALTA software are 3 example datasets. The third describes an shRNA library we obtained that was made in a lentivirus vector containing about 9,303 different elements, where each element is a different shRNA sequence designed to suppress expression of a particular human gene. This shRNA library was characterized by next-generation sequencing to identify each element present and to quantify its proportion in the population. The library elements have varying abundance, with the most abundant element enriched over 4000 fold when compared to the lowest abundance elements. The estimation MALTA performed indicates that to have a 97 percent chance of obtaining at least one of everything in the shRNA library, one would need over 1.5×10 7 picks. By using setting the 'number of picks' to 1.5×10 7 , the MC simulations verify this estimate, showing empirically that the elements with the lowest abundance are selected an average of ~5 +/− 2 times across multiple MC simulations (Figure 4). Setting the MC simulations to only 1.0×10 6 picks shows the lower abundance elements are not selected, with an average well below 1.
Extending this procedure, a user can also model how well the library would be sampled if the goal was to only transfer more abundant elements and not configure the experiment so that it was obliged to cover all elements regardless of how rare they are in the population. Adjusting the right-hand 'Data Curtain' to bin low abundance elements together has dramatic effects on the estimation MALTA makes for coverage. For instance, estimating the number of picks to cover the most abundant 9211 elements in the library, rather than all 9303 elements to a coverage of 99% produces a target of only 786,432 picks, down from 2.5×10 7 picks for the full library. By making these adjustments, users can sample a number of scenarios and choose which ones make the most sense for their genetic experiments.
One can also use MALTA to better understand what type of coverage was achieved with a given number of transformants that was obtained in an experiment. Here, setting the number of 'Sim Picks' to 1 million for the MC simulations shows that coverage of the library would still obtained, except for the lowest abundance elements. This type of calculation is helpful retrospectively, allowing one to appreciate how much library coverage was obtained once the efficiency to viral transduction is obtained.
For some experiments, investigators may want to ensure that multiple copies of each library element are recovered in the recipient cell population. The estimator function in MALTA suggests a number of picks to ensure 1 copy of each element in the cell population. That number is then validated by running the Monte Carlo simulations. Users can increase the number of picks the MC simulation will perform, allowing the user to find how many picks are required to obtain multiple copies of each library element. In practice, an estimate for 10 fold coverage of a library can be obtained by multiplying the estimation value by 10, or putting a '0' after the number in the 'Sim. Picks' query box prior to running the MC simulation.

Impact
MALTA provides an easy-to-use way to evaluate the complexity of libraries used for genetic screens and will help investigators appreciate and accommodate the appropriate scale for their experiments. One of the important parameters of using shRNA, CRISPR, cDNA or other libraries effectively is to ensure that the library is adequately sampled, such that the majority of elements are reproducibly transferred to the cells of interest so that the majority of elements can undergo genetic evaluation. The example library above represents part of a recent lentivirus shRNA library that we made. This library was synthesized in 12 pools of ~10,000 elements each, with each pool having different elements or shRNA sequences.
Although only a relatively small library, the varying proportion of each element (varying over 4 orders of magnitude), oblige one to transduce ~1×10 7 lentiviruses from the library to ensure obtaining at least one of all elements. The requirement for such a high number can be hard to appreciate in the absence of the calculation that MALTA can make. Notably, there is an expanding number of similar libraries constructed to allow for comprehensive genetic screening. Examples include several libraries encoding CRISPR gRNA sequences, that can combine with Cas9 and inactivate, inhibit, or stimulate different genes [15][16][17][18]. Many of these libraries contain far more elements (100,000-200,000) than the shRNA library used as the example in Figure 4. In addition, procedures for expanding (amplifying) the amount of the library are known to exacerbate differences in the relative abundance of each element. The combination of these factors underscore the necessity to use a tool like MALTA to ensure that experimental procedures will be used that accommodate the scale required to adequately sample the entire library and allow quantitative comparisons.

Conclusions
The expanding variety of DNA-based libraries used to conduct comprehensive genetic screens offer investigators powerful tools to understand a widening range of biological processes. The availability of next-generation sequencing techniques to fully characterize those libraries in terms of the number of elements they contain and the relative proportion of those elements can provide the necessary data for a program like MALTA to calculate the required scale for experiments that rely on sampling those libraries adequately. The mathematical solution to these calculations, which follows the principle of the wellestablished coupon collector's problem, is too computationally intensive for libraries (collections) of this type. Thus, MALTA allows for an empirical assessment by predicting the coverage of a library using a series of Monte Carlo simulations. The estimate for the number of picks performed in simulations can be estimated by MALTA and users can also refine the initial estimate to only cover a subset of elements in the library or alter the predicted probability of covering all elements in the library.   A. Graphic User Interface of MALTA displaying data for 100 items with differing levels of abundance. User controlled features include: a) "Population Curtains" that allow data above and below the selected threshold to be binned together separately. When used, MALTA does not calculate the number of picks required to retrieve one of all the individual items within this binned population. b) "Coverage" is the user-specified probability of getting at least one of each within the highlighted population (pink bars). c) Sim Picks is the number of picks that will be used for the Monte Carlo simulation. d) Sim Runs is the user specified number of times the Monte Carlo simulation will be performed. e) Starts the Monte Carlo simulations. Results from multiple runs are then averaged and a standard deviation calculated. f) Tabs to view the input data and the results from the simulations.