Hamming Distance as a Concept in DNA Molecular Recognition

DNA microarrays constitute an in vitro example system of a highly crowded molecular recognition environment. Although they are widely applied in many biological applications, some of the basic mechanisms of the hybridization processes of DNA remain poorly understood. On a microarray, cross-hybridization arises from similarities of sequences that may introduce errors during the transmission of information. Experimentally, we determine an appropriate distance, called minimum Hamming distance, in which the sequences of a set differ. By applying an algorithm based on a graph-theoretical method, we find large orthogonal sets of sequences that are sufficiently different not to exhibit any cross-hybridization. To create such a set, we first derive an analytical solution for the number of sequences that include at least four guanines in a row for a given sequence length and eliminate them from the list of candidate sequences. We experimentally confirm the orthogonality of the largest possible set with a size of 23 for the length of 7. We anticipate our work to be a starting point toward the study of signal propagation in highly competitive environments, besides its obvious application in DNA high throughput experiments.


■ INTRODUCTION
Molecular recognition in the crowded environment of DNA microarrays plays an important role in processing information. Recognition often requires the discrimination of one specific molecule among many similar, competing molecules. In 1894, Emil Fischer proposed the lock and key model to describe the recognition of an enzyme and a substrate. 1 According to this model, the substrate possesses the perfect size and shape to accommodate the active site of its complement. However, in crowded environments, binding between noncomplementary molecules may occur and result in introduction of errors. For DNA, specific-binding of two single strands, that is the formation of a stable double helix, occurs only if the bases A and T as well as C and G pair along the sequence. DNA microarrays are a widely used platform that, besides many applications in medicine and biology, enables the study of the fundamentals of DNA hybridization. 2−10 These microarrays consist of single-stranded DNA oligonucleotides immobilized on a surface (probes). If these probes are exposed to a bulk mixture of fluorescently labeled target sequences, only complementary targets are expected to hybridize. However, hybridization of a probe to a noncomplementary target still occurs, albeit with a lower binding affinity than the corresponding perfectly matching sequence. Therefore, similarities among probes can lead to a significant amount of nonspecific cross-hybridization. On a DNA microarray with complex target mixtures, imperfect recognition introduces noise and makes results difficult to interpret.
The kinetics of hybridization in the presence of competitors and the importance of cross-hybridization for quantitative interpretation of microarray data have been intensely studied, 11−13 especially for the purpose of single nucleotide polymorphism detection and the accurate assessment of gene expression levels. 14−17 One strategy to avoid cross-hybridization is to construct sets of probes with minimized pairwise competition so that they do not cross-hybridize. Such probes are often referred to as orthogonal. Previous theoretical research 18−24 developed different strategies to find sets of orthogonal sequences. The most intuitive approach to decide, which sequences cross-hybridize, is based on the free energy difference between the perfectly matched and mismatched hybridization. 25 However, estimating free energies led to poor predictions of hybridization intensities on microarrays. 26 In this work, we apply a well-known local search algorithm and implement graph-theoretical methods to find such sets. Following the concept of Hamming distance from coding theory, we consider that two sequences do not cross-hybridize if they differ by at least a certain number of bases. This threshold is called minimum Hamming distance d. 27 We determine a suitable d experimentally. One of the fundamental problems in coding theory is finding the maximum size of a code, where a code is a set of codewords with the length L and minimum Hamming distance d. 28 In analogy, here, we experimentally and theoretically find maximal sets of independent (i.e., orthogonal) sequences (MIS) with a certain minimum Hamming distance that can coexist on a microarray without exhibiting cross-hybridization.

■ RESULTS AND DISCUSSION
Theoretical Results. For a given strand with L bases, according to all permutations of DNA bases (A,C,T,G), there are 4 L distinct sequences. However, some of these sequences exhibit undesired structures that prevent them from binding to their complement. An example is the sequences with runs of at least four guanines that we call 4G sequences. These sequences are capable of forming complex structures such as Gquadruplexes, which restrict hybridization. Moreover, they have abnormal affinities and tend to show increased crosshybridization and reduced target-specific hybridization, which makes the measurement of gene expression unreliable. 29−31 Therefore, in this work, we eliminate 4G sequences and their complement sequences 4C. The number of sequences for a given length L that exhibit at least one run of 4G is given as where the sum represents the number of sequences that are not 4G. The quadrinomial coefficient equals the number of permutations of L−k guanines within a sequence length of L (for the derivation of eq 1, see section S1).
To verify eq 1, we numerically calculate N 4G (L) by generating 4 L sequences for a given L ≤ 7 and discarding the ones that contain 4G sequences. Figure 1 illustrates N 4G (L) in comparison to the total number of sequences 4 L , for different lengths. As depicted in Figure 1, for L ≤ 7, this fraction stays below 1.5%, whereas for longer lengths, it rises, so that for L ≥ 200, around 50% of all possible sequences contain 4G structures.
The second category of sequences that will not contribute to recognition are self-complementary sequences. We neglect them as we work with short sequences where selfcomplementarity only plays a minor role. 32−34 For longer lengths, however, this must be considered.
Coding theory is a branch of mathematics that studies codes and their properties for different applications. A code is a set of codewords. The length of a codeword L is the number of letters that create the codeword, where the letters are often taken from an alphabet. In our case, DNA sequences are taken as the codewords, where L is the number of bases (A,C,G,T) that make up the sequence. The number of positions that two codewords of the same length differ is the Hamming distance. 27 In case of DNA sequences, we define this distance as the number of bases by which they differ. We assume that for every sequence of a given length there is a minimum Hamming distance d in such a way that there is no cross-hybridization as long as the Hamming distance k is larger (or equal) than d. If two sequences differ by less than d, they may cross-hybridize. For a given sequence, N(d) is the number of sequences from which one can choose a competitor with k ≥ d. N(d), decreases by increasing d (Figure 2). Equation 2, for a given length L, gives the number of sequences P L (k) with the Hamming distance k. Figure 2 depicts N(d) obtained by summing P L (k) over all k ≥ d for L = 7 and a given minimum Hamming distance.
Solving maximum independent set problems is believed to be NP-hard. There is no general exact solution, however, there are approximations. 35,36 Finding maximal independent set (MIS) in N(d) is a problem related to graph theory. 36,37 A graph consists of vertices represented by red circles in Figure 3a. Two vertices are called adjacent if they are connected by an edge (blue line). We represent the probes by vertices. If two sequences are such

ACS Omega
Article that they hybridize to each other, we connect them by an edge (Figure 3b). An independent set is a subset with no adjacent vertices. If adding any sequence to the set corrupts its independency, the set is called MIS. The largest possible size of a maximal set refers to the maximum independent set. Here, MIS corresponds to the largest number of independent oligonucleotides that can be found. For our approach, we create an adjacency matrix for a given L and d, where the number of rows and columns correspond to the number of sequences; thus, it is a 4 L × 4 L square matrix (Figure 3c). If the Hamming distance between sequences i and j is less than d, they cross-hybridize, that is, they are connected by an edge. In this case, We apply a constructive local search algorithm 38,39 that iteratively adds orthogonal sequences to an existing set until the available sequences are depleted. To identify the orthogonal sequences the algorithm employs the adjacency matrix constructed beforehand. The algorithm is restricted as it does not try all combinations of sequences. Therefore, it does not necessarily find the maximum independent set but proposes many maximal independent sets instead. We consider the largest set among them as an approximate solution to the exact size of the maximum independent set. All obtained set sizes are within the known Singleton and Gilbert−Varshamov 28,40 bounds and are summarized in Tables S2 and S3 along with a comparison to literature values. The size of the adjacency matrix increases exponentially with the sequence length. This requires a large memory. Therefore, we are limited to short sequences L ≤ 7. Figure 4 illustrates the possible sizes of different independent sets for L = 7 and d = 5 before discarding 4G and 4C sequences and afterward. The MIS size M(L, d) in both cases is 23. Removing these sequences for L ≤ 7 does not change the size of MIS in most cases (refer to Table S2). However, for longer lengths, the fraction of 4G rises and we expect that discarding such sequences reduces the size of a MIS (Figure 1). This algorithm creates independent sets, based on the pool of available sequences. Removing all sequences containing 4C and 4G changes this pool. Therefore, we obtain different independent sets (blue columns) compared with the cases where we did not discard these sequences (red columns). A significant trend toward smaller or bigger set sizes by removing 4C and 4G sequences cannot be identified.
Experimental Results. A suitable minimum Hamming distance d must be determined experimentally. Because the longest sequences studied with our algorithm are 7-mers, we design a microarray consisting of oligonucleotides of length 7 (plus four additional terminal bases, see Material and Methods). We immobilize, complementary to a perfectly matching target (PM), an arbitrary sequence and some of its related mismatched sequences. To study the dependency of hybridization probability on the positions of defects, we locate the mismatches at the ends, in the middle of the sequence, or uniformly distribute them. Hybridizing the PM target on the microarray yields the results shown in Figure 5. Each feature block, as depicted in Figure 5a−d, corresponds to a set of sequences with one to four mismatches MM1−MM4, respectively. They are all surrounded by a frame of PMs. Each sequence appears 8 times within a feature block. To have

ACS Omega
Article better statistics, the hybridization intensities from all sequences are averaged, and their standard deviations σ are calculated. Then, all intensities are normalized relative to the average PM intensity on the microarray. The PM and mismatched sequences are all subject to the same constant synthesis error rate (see Material and Methods), which leads to an overall loss of hybridization intensity. For the results presented in the following, the relative intensity is of importance, which is not affected by this loss. Fluorescence intensity variations are due to inhomogeneities of the microarray surface, fluorescent stains in the feature blocks, or illumination gradients during synthesis. 9 For all MM ≥ 4, we detect no other intensity than PM hybridization (not shown). Figure 6 presents the normalized fluorescent intensities of hybridization for a sequence with one mismatch as a function of defect positions. The intensity for sequences with single mismatches in the middle is smaller because the defects in the middle of the duplex increase the base pair opening probability and destabilize the duplex. This result agrees well with previously reported work. 10,41 We assume all eight fluorescence intensities of one probe measured at different positions on the microarray to be normally distributed and described by a standard deviation σ. To discriminate the PM binding intensity from all other nonspecific binding, the normal distributions of their hybridization intensities must be well separated. We show in Figure 7, the distributions of the fluorescence intensities of PM and the sequences which exhibit the highest cross-hybridization intensities I MM,max for MM1−MM3. The normal distributions are based on a statistical analysis of the microarrays shown in Figure 5. The peak centers in Figure 7 correspond to the average value of the fluorescence intensities and their widths to the standard deviations (shown in Table 1). In DNA microarrays, the binding affinities can largely vary, depending on the precise sequence and its concentration, 41 that is, fluorescence intensities of perfectly matched sequences span a large range. To illustrate that we determine the hybridization free energy of the sample sequence 3′-CTATATATATC-5′ binding to its PM using Nupack software 42 and the corresponding expected fluorescence intensity using the Langmuir isotherm. 9 As this sequence does not contain any G or C bases within the seven core bases, its fluorescence intensity is amongst the lowest of all possible sequences. In fact, we find that it has just 16.5% of the fluorescence intensity, obtained by the same procedure, for the PM sequence 3′-CTACCGTACTC-5′ used on the microarray shown in Figure  5. Accordingly, it should be expected that some perfectly matched but weakly binding sequences will have lower hybridization intensities than the 27% signal of I MM,max for three mismatches. This clearly shows that a minimum Hamming distance of d = 3 cannot be used for a reliable discrimination between PM and MM hybridization. Therefore, we investigate sets with d ≥ 4 in subsequent experiments. Table  1 shows the sequences and their intensities as well as the corresponding standard deviations for each mismatch.
To test sets with d ≥ 4, we first design a microarray consisting of 23 sequences (see Table S1) as predicted by our algorithm, corresponding to d = 5 (compare Figure 4). To verify its independence, we record the hybridization intensities of three arbitrarily chosen PM targets of this set simultaneously. Figure 8a shows the measured normalized hybridization intensities I seq in a barplot after background subtraction. It can be clearly seen that the PM targets, which are present in solution, hybridize to their corresponding complementary probes only (green bars). By using the highest hybridization intensity as a reference, the other hybridized PM sequences reach 24 and 31% of that level. On the other hand, the measured hybridization intensities of all other probes (blue bars) scatter with σ = 0.3% around their average value of zero, Figure 6. Normalized hybridization intensity for the sequences with one mismatch as a function of their mismatch position. The intensity for sequences including a single mismatch in the middle is smaller than for a MM located at the end. Figure 7. Normal distribution of the PM and MM1−MM3 hybridization intensities. Assuming a normal distribution with average intensity (peak centers) and standard deviation σ as given in Table 1. Even the average cross-hybridization intensity of I MM3,max = 27% is too high for accurate discrimination of PM-binding and unwanted crosshybridization (compare main text).

ACS Omega
Article which can be attributed to the background fluorescence noise. Negative values correspond to the intensities below the average background level. The intensities of the probes, whose PM targets are not present in a solution, stay well below 2% within a large confident interval (5σ environment). To cross-check that the sets with d ≤ 4 are not independent, we synthesize another microarray including 83 sequences with d = 4. Hybridization of one PM leads to cross-hybridization of 11 other probes that rise above 2%, as can be seen for the red bars in Figure 8b. This underlines that d < 5 is not sufficient to achieve independency.

■ CONCLUSION
In this work, we experimentally determined a minimum Hamming distance d between DNA oligonucleotides. Sequences with a distance of d can make up an orthogonal set, which means they do not cross-hybridize. By applying a local search algorithm, we found orthogonal sets for different L and d. For the length of 7, we determined a MIS with the size of 23 and experimentally confirmed its orthogonality with an appropriate minimum distance of 5. The small set size of 23 compared with 4 7 possible sequences arises from the minimum Hamming distance of 5. Technology of optically directed synthesis introduces errors into sequences. 43−46 Single-nucleotide polymorphism detection in bulk has been achieved, albeit with higher synthesis fidelity and optimized experimental conditions. 47 Moreover, d can be reduced by increasing the temperature to reduce nonspecific bindings, which can improve the discrimination among the sequences of a set. 47 For longer sequences lengths, higher temperatures are particularly important to increase the number of complementary bases that enable binding. 48 At a given concentration, the discrimination increases near a melting temperature.
In the course of our experiments, we found a minimum Hamming distance of five for a sequence length of 11 (7 core bases and four terminal extra ones) in a good agreement with the discrimination level of d ≈ L/2 that is reported. 18,19 Our set size, on the other hand, does not gain from the four additional bases. By extending our algorithm to longer sequences, these extra bases are redundant, and we expect d ≈ L/2 will remain applicable. With the same d, larger lengths lead to larger set sizes than we have determined here.
We also derived an analytical expression to calculate the number of 4G sequences. As we have shown, eliminating these sequences for short lengths does not change the size of MIS in most cases. However, we anticipate an impact for higher sequence lengths, as the fraction of sequences containing 4G structures increases. Although we could show how to avoid cross-hybridization in our synthesis microarray, we cannot easily transfer it to the real world microarray application as developed by Affymetrix. Following the protocol for expression studies, Affymetrix targets are very long compared with their surface bound probes. Such sequence lengths introduce a large variety of conformations. Therefore, in expression studies one should consider additional effects such as the brush effect 49 and surface density of probes. 7,50 ■ MATERIALS AND METHODS DNA Microarray Hybridization Experiment. The lightdirected in situ synthesis method and some of the analysis software were described previously. 4,41,51 We use in-house synthesized DNA microarrays. Probes on a microarray are tethered to the surface from their 3′-end. To increase the hybridization probability at the given temperature, we extended all sequences by adding four bases, CT at the 3′ and TC at 5′ end. The microarray synthesis used in our work has a stepwise coupling efficiency of ≥99%. Considering the sequence length 7, this leads to an estimation of probes free of any synthesis defects of 93%. 52 The remaining 7% have mostly one defect. The targets are prepared in 25 nM concentration in a 5× SSPE buffer solution. Their terminus is labeled by a Cy3 fluorescent dye. Hybridization is performed in equilibrium with the buffer in a chamber designed for that purpose. We use an UPlanApo 10× 0.40 NA objective for observation. Figure 9 shows the image of a hybridized microarray as obtained after 100 s exposure time. The particular probe sequence species are restricted to small areas commonly called features. To determine the amount of bound targets to a probe, we measure the fluorescence intensities (hybridization intensity) by taking images from DNA microarray surfaces with an electron multiplying EM-CCD camera (EM-CCD C9100-02, Hama- Figure 8. Two microarrays consisting of sequences with two different minimum Hamming distance, (a) independent set with d = 5 and (b) set with d = 4. In both cases, the green bars present the probes whose PM targets are present in solution. The blue color corresponds to the hybridization intensities of sequences with I seq ≤ 2%. Red bars represent the cross-hybridized sequences with I seq > 2%.

ACS Omega
Article matsu). We correct for background fluorescence originating from the unhybridized targets in the buffer by subtraction. Microarray pictures shown in the Experimental Results are computationally reconstructed by using these intensities, for example, Figure 5a is produced from Figure 9. The hybridization temperature is 32°C.