Given a library of L sequences, comprising variants of a sequence of N nucleotides, into which random point mutations have been introduced, we wish to calculate the expected number of distinct sequences in the library. (Typically assuming L > 10, N > 5, and the mean number of mutations per sequence m < 0.1 x N).
Saab-Rincon et al (2001, Protein Eng., 14, 149-155) constructed a library of 5 million clones with a single round of epPCR on a 700 bp gene. Sequencing 10 of these, indicated an error rate of 3-4 nucleotide substitutions per daughter sequence. Entering L = 5000000, N = 700 and m = 3.5 into the base PEDEL sever page, and clicking 'Calculate', shows that the expected number of distinct sequences in the library is 4.153 x 10^6, or about 4.2 million.
If you follow the link to 'detailed statistics' and, once again, enter L = 5000000, N = 700 and m = 3.5 and click 'Calculate', you get a breakdown of library statistics for each of the sub-libraries comprising all those daughter sequences with exactly x base substitutions (x = 0, 1, 2, 3, ...).
For example the first line of the table shows that Px = 3.02% of the library (i.e. Lx = 1.51 x 10^5 daughter sequences) have x = 0 base substitutions (i.e. they are identical to the parent sequence). The total number of possible variants with 0 base substitutions is, of course, Vx = 1 (just the parent sequence) and the total number of distinct sequences with 0 base substitutions present in the library is, similarly, Cx = 1. The completeness of the x = 0 sub-library is Cx/Vx = 100%. The redundancy of this sub-library - i.e. wasted duplication - is Lx-Cx = 1.51 x 10^5.
You also have the option to plot this data by following the 'Plot this data' link. Choose the statistic to plot and whether or not to use a logscale on the y-axis. For example, a plot of Px or Lx gives a Poisson distribution. A plot of Vx shows how the number of possible variants increases very rapidly as the number of base substitutions is increased. A plot of Cx shows how the expected number of distinct sequences in the sub-libraries initially increases - limited by the number of possible variants, Vx - and then decreases - limited by the size of the sub-library, Lx. A plot of Lx-Cx shows the extent of wasted duplication in the lower x-value sub-libraries.
Returning to the base PEDEL server page, you can follow links to plot the expected number of distinct sequences in a library for a range of mutation rates, library sizes or sequence lengths. The third option probably won't be very useful, but the first two will help you to decide what library size to aim for in order to obtain a given diversity, and what mutation rate to use to maximize the diversity for a given library size.
For example, follow the 'mutation rates' link, enter L = 5000000, N = 700 and m = 0.2 - 20, and click 'Calculate'. From the plot, you can see that the expected number of distinct sequences increases rapidly with m until m ~ 5, and then levels off with < 10% redundancy in the library. On the other hand, if you chose m ~ 1.5, then the library would be about 60% redundant. After selecting an optimal mutation rate m, you can go back to the 'detailed statistics' page to check the expected completeness of the x = 0, 1, 2, 3, ... sub-libraries.
PEDEL uses a generic Poisson model of sequence mutations. There are a couple of simplifications that you should be aware of:
A good review of the sources of bias in epPCR (and other directed evolution protocols) can be found in Neylon C., 2004, Chemical and biochemical strategies for the randomization of protein encoding DNA sequences: library construction methods for directed evolution, Nucleic Acids Res., 32, 1448-1459.
Links to academic resources and academic social media presence of the authors