Beach Bug Bingo Toward Better Prediction of Swimming-Related Health Effects

Swimming is a popular pastime in the United States. The 2000–2002 National Survey on Recreation and the Environment reported that each year an estimated 89 million Americans swim in recreational waters including lakes, oceans, streams, rivers, and ponds. But swimming waters may also be contaminated by human sewage from treatment plants and runoff, raising the risk of gastrointestinal (GI) illness in swimmers. The recommended test for measuring contamination requires culturing fecal indicator bacteria, which means that beach managers must wait 24 hours for results. This built-in delay is problematic, potentially exposing swimmers to unhealthy water quality and sometimes resulting in unnecessary beach closures. Now a team of federal researchers has shown that a rapid method for measuring water quality can accurately predict swimming-related health effects [EHP 114:24–28]. 
 
The researchers conducted health surveys of beachgoers at two public beaches, one on Lake Michigan and one on Lake Erie, and compared them with thrice-daily water quality measurements along transects at the beaches. They evaluated water quality using a modified version of the polymerase chain reaction method (QPCR) to quantify indicator bacteria in water samples. The advantage of this method is that it can provide results in two hours or less. The researchers chose Enterococci and Bacteroides as their indicator organisms. 
 
Survey participants were interviewed as they left the beach; follow-up interviews were conducted by telephone 10 to 12 days after the beach visit. When researchers compared results of the water quality tests to participant reports of GI and other illnesses, they found a significant trend between increased reports of GI illnesses and Enterococci at the Lake Michigan beach and a positive, though statistically insignificant, trend for Enterococci at the Lake Erie beach. Bacteroides did not prove to be as powerful in predicting illness, with an insignificant positive trend found only at the Lake Erie beach and no trend at the Lake Michigan beach. 
 
When results from the two beaches were combined, the trend for Enterococci and GI illness remained statistically significant, a finding that held true even when samples collected at 8:00 a.m. were compared to daily averages. Beach managers could thus test early-morning samples to assess water quality and, if necessary, close beaches before the majority of swimmers were exposed. 
 
In spite of the promising nature of the findings, the authors caution that much research remains to be done before the results can be generalized. One of the key remaining questions relates to the method itself: QPCR relies solely on the presence of DNA to quantify organisms, so pathogens are detected even if they are dead and thus harmless. QPCR may therefore suggest a problem with the water when in fact there is none. The authors say additional studies should help determine if the approach is robust enough to be used in water quality regulations.


Background
Haplotype data on dense markers contain local linkage disequilibrium information on historical recombination and mutation events, and the knowledge of haplotype structure has lead to a growing belief that haplotypes may hold the key to understanding and identifying genetic variants underlying complex traits [1]. The availability of thousands or even millions of single nucleotide polymorphisms (SNPs) on the human genome requires systematic analysis in coping with both optimal modeling and computational efficiency. Haplotype sharing methods have shown promising results in gene mapping analyses in complex settings [2][3][4][5][6]. To analyze the SNP data provided by the Collaborative Study of the Genetics of Alcoholism (COGA), we implemented an algorithm for haplotype reconstruction under the criteria of minimum recom-binants and coalescent tree, and performed haplotypebased association analysis by the haplotype-sharing correlation (HSC) method [6,7]. The purpose of this paper is to evaluate the feasibility of our haplotype reconstruction algorithm and the HSC method when applied to nuclear family data with a limited amount of missing genotypes.

Data
The original COGA data contained 143 families, with an average family size of 11.2 ± 5.4 members and 9.3 ± 4.3 of them having SNP genotype data. To evaluate the feasibility of haplotype reconstruction and HSC analysis, we chose to analyze a dataset on chromosomes 1-6 in all 93 nuclear families with genotype data for both parents and at least 3 offspring. These nuclear families had an average family size of 6.6 ± 1.7 (range from 4 to 14), and contained a low proportion of 0.1% missing SNP genotypes. The phenotype variable to be analyzed was DSM-IV alcohol dependence, which was coded as ordered values of 1 for "pure unaffected", 2 for never drank, 3 for unaffected with some symptoms, and 5 for affected, and was treated as a continuous variable in HSC analysis.

Haplotype reconstruction
Haplotypes in nuclear families were reconstructed in 2 steps using a search algorithm under the criteria of minimum recombinants and coalescent tree. In step 1, all possible minimum recombinant haplotype configurations (MRHCs) were reconstructed within each family under the criteria of minimum recombinants [8]. The number of possible MRHCs in each family depends on both the family size and the transmission process of haplotypes, and some nuclear families may have more than 100 MRHCs that are consistent with the observed genotype data.
In step 2, each MRHC in each nuclear family was evaluated by fitting the combination of its founder haplotypes and all founder haplotypes in other families to a coales-cent tree structure, where the founder haplotypes were referred to the 4 parental haplotypes in each family. The MRHC corresponding to a coalescent tree with minimum tree distance was selected as the optimal solution of haplotypes. The computation of tree distance in a set of haplotypes is as follows. First, the sharing in each pair of haplotypes is quantified as the number of identical-bystate intervals summed over all markers, and the distance between 2 haplotypes is defined as the observed sharing subtracted from the maximum possible sharing. Second, a single haplotype showing the minimum sum distance with all other haplotypes is chosen as the ancestral haplotype. And third, all haplotypes are connected one-by-one starting from the ancestral haplotype using a minimum spanning tree algorithm [9], and the tree distance is defined as the minimum distance that connects all the haplotypes.

Haplotype-sharing correlation
The HSC method evaluates the correlation between phenotype similarity and haplotype sharing at each marker m in all pairs of pedigree founder haplotypes [6,7]. The HSC statistic can be written as HSC analysis of DSM4 alcohol dependence on chromosomes 1-6 in 93 nuclear family

Results
On average, we were able to reconstruct haplotypes at all markers on a whole chromosome in 98.2% of the nuclear families. For the other 1.8%, haplotype phases on less than 1% loci could not be inferred with uncertainty conditional on the criterion of minimum recombinants, and those loci were treated as missing in reconstructed haplotypes. A haplotype at a missing locus was considered to have no sharing with any other non-missing haplotypes.
In an HSC analysis on chromosomes 1-6 in 93 nuclear families, three markers on chromosomes 3, 4, and 6, respectively, were found to have significant associations with DSM-IV alcohol dependence that exceeded the 0.05 level of chromosome-wide significance (Fig. 1). Marker rs1631833 at 109.1 cM on chromosome 4 was found to have the strongest haplotype association among the 6 chromosomes analyzed (p = 0.008). Marker rs895941 at 36.7 cM on chromosome 3 and marker rs953887 at 74.2 cM on chromosome 6 were the other two markers revealed significant haplotype association (p = 0.03 and p = 0.02, respectively).

Discussion
We have developed a 2-step algorithm for haplotype reconstruction in nuclear families that avoids the assumption of linkage equilibrium by minimizing the recombinants in within-family haplotype transmissions and fitting all parental haplotypes under a coalescent tree structure. The choice of analyzing nuclear families each with a large number of offspring was mainly under the feasibility consideration for testing the algorithm. When SNP data on chromosomes 1-6 were analyzed, haplotypes on less than 0.1% a loci in 1.8% of nuclear families could not be inferred with certainty. One possible reason for the failure of haplotype reconstruction in some nuclear families is the uncertainty in counting the number of recombinants in the presence of missing genotypes. We are currently investigating the failures and alternative approaches in order to improve the haplotyping performance in the presence of missing genotypes.
The HSC method evaluates the correlation between phenotype similarity and haplotype sharing at each study marker in all pairs of pedigree founder haplotypes. When applied to the COGA data on chromosomes 1-6, 3 markers were found to have significant haplotype associations with DSM-IV alcohol dependence. The most significant signal at 109.1 cM on chromosome 4 was consistent with the strong linkage signal found on the same region using the maximum number of drinks ever consumed in a 24hour period as an alcoholism phenotype [10]. On a different note, the HSC method is not designed for controlling population stratification, although empirical results have indicated its robustness against allele heterogeneity when compared to allelic and haplotypic family-based association test [7]. Additionally, the HSC analysis does not consider within-family phenotypic correlations, and such a treatment may have an adverse effect in detecting the true associations.
Both the haplotype reconstruction and the HSC methods employed in this study have potential applications for haplotype-association studies under settings of both family-based and case-control designs. To improve the mapping of susceptibility regions associated with complex traits, clustering approaches, such as described by Yu et al. [11], may be employed in both haplotype reconstruction and haplotype association analyses. With clustering analysis, the plausibility of a candidate haplotype pair will be evaluated not by all existing haplotypes but only those believed to have the same ancestral origin. By the same token, clustering analysis will also increase the power of association analysis by reducing the ancestral heterogeneity in haplotypes associated with the same or similar phe-