Using reproductive effect markers to observe subclinical events, reduce misclassification, and explore mechanism.

Biological markers of effect, in general less widely available than exposure markers, do exist in the field of reproduction and increasingly are being used in epidemiological studies. Several such markers, including semen quality, menstrual hormones, early pregnancy loss, and placental abnormalities, are cited as examples. We argue the value of effect markers for detecting subclinical events that are critical for reproductive performance. Such studies can extend knowledge of the true frequency and determinates of reproductive disorders. A second portion of the paper deals with the role of effect markers in reducing disease misclassification. With a hypothetical early pregnancy study as a case in point, we illustrate the degree and direction of bias associated with several different protocols and encourage epidemiologists to weigh these quantitative considerations in deciding on study design. Finally, we discuss uses of biological markers to explore mechanisms, drawing on experience in an ongoing reproductive study that is testing a hypothetical pathway from maternal psychosocial stress to reduced fetal growth using urine catecholamine levels as a physiological marker of exposure and placental vascular abnormalities as a marker of effect.


Introduction
Biological markers, if they are well chosen, can be important aids for discovering, describing, and interpreting associations between exposure and disease. The field of reproductive epidemiology is fortunate to have available a number of biological effect markers. Table 1 gives selected examples. The term "effect marker" is used here to mean any change indicative ofa problem (a correlate, a precursor, an occult event). For some of the markers in Table 1, like maternal serum alphafetoprotein screening or computer-assisted semen analysis, there is a fairly broad base of experience in population studies. For other markers, like menstrual and pregnancy hormones, use in field settings is only just beginning. We discuss past and future applications ofbiological markers for adverse reproductive effects in the context ofthree topics: observing subclinical events, classifying outcome accurately and exactly, and investigating pathogenesis.

Observing Subclinical Events
The events surrounding fertilization and early pregnancy, while crucial to reproductive performance, are largely unobservable. The more interest there is in research on fertility, the more important are methods for measuring reproductive potential in

Semen Analysis
Clinical tools for evaluating testicular function in males were first applied to population studies over a decade ago (1). Taking a technique based on semen samples into the field presented serious challenges, and research protocols have evolved considerably over time. Requirements for collecting and transporting semen specimens are now better defined and more uniform across studies. Normal values and ranges for measured sperm parameters have been established. Technological improvements such as computerized automation have extended semen analysis capabilities to more sites and have helped to standardize analytic  (2). One investigator has even applied sophisticated in vitro bioassays of sperm fertlization capacity in field research using a new method for preserving and shipping specimens to an offsite laboratory (3). As a marker, semen analysis has proved useful for assessing the reproductive toxicity of a variety of exposures, particularly workplace agents (4). While the quantitative relations of semen quality to couple fertility are not firmly established, on a qualitative level an association has been demonstrated (5), and artificial insemination programs select their donors accordingly, for high sperm count and function. Thus, despite some uncertainty about the clinical significance of observed changes in specific sperm parameters, semen quality is generally considered a useful marker.

Menstrual Disorders
Just as male reproductive potential is reflected in the inseminate, so female reproductive potential is indicated by the adequacy of the follicular, ovulatory, and luteal phases of the menstrual cycle. Fertility in females is clearly decreased when there is substantial menstrual cycle variability, as in the postpubertal (6) and perimenopausal (7) periods. But even regular menses can mask ovulatory disorders (8), luteal phase defects (9), or early pregnancy loss (10). Because ofthe marked fluctuations characteristic ofthe cycle, a detailed hormonal profile once required serial blood samples, which were feasible only in a clinical setting. With the advent of less invasive techniques that utilize urine (11) and saliva (12) instead of blood, it is now possible to contemplate inclusion of endocrine profiles in field studies (13).
Menstrual disorders are important in their own right, as well as in relation to fertility and risk ofchronic diseases that are influenced by reproductive hormones. A research approach based on hormonal evaluation will more fully ascertain menstrual disorders and may indicate an underlying cause, but it demands collection ofbiological samples as often as daily. Some work of this kind has been done in the context ofresearch on population control in order to delimit the fertile period or document return to fertility postlactation (14). Evaluation ofmenstrual hormones (luteinizing hormone, estgen, progesterone) was included in a landmark study of early pregnancy loss by Allen Wilcox and colleagues (10), and has been proposed in conjunction with other early pregnancy studies currently in planning or in progress. Such work utilizing ovarian markers should contribute importantly to an understanding of female reproductive function.

Early Pregnancy and Pregnancy Less
The rate of attrition among human conceptions is extraordinarily high, and most of the loss occurs prior to the expected onset ofmenses, when pregnancy might first be recognized (15).
One marker ofpregnancy, the so-called fetal signal, is the production by the trophoblast of hCG (human chorionic gonadotropin), a glycoprotein hormone with a structure and function similar to hLH (human luteinizing hornone). If conception occurs, hCG can be detected as early as 7 to 8 days after ovulation or around the time ofimplantation. Although levels of hCG are similar in blood and urine, it is only recently that urinary assays have approached the sensitivity and specificity of tests on serum. With this development, the epidemiology ofearly pregnancy and early loss can begin to be explored on the basis of rise and fall in hCG across serial urine samples.
The initial field studies applying urinary hCG as a marker of pregnancy produced widely divergent estimates ofthe frequency ofclinically inapparent fetal loss (16)(17)(18), at least in part because the hCG assays used were insensitive and/or cross-reactive with hLH. More recent work by Wilcox at the National Institute ofEnvironmental Health Sciences in collaboration with researchers at Columbia University who developed a highly sensitive and specific hCG assay (19) set a new stadard for early pregnancy studies. Analyses ofurines collected daily from over 200 healthy volunteers documented early inapparent losses in approximately 22% of hCG-detected conceptions. The total rate of loss, including recognized miscarriage, was 31% (10). Now that the ground has been laid, future work can examine the causes ofoccult pregnancy loss and the role it plays in conception delay and clinical infertility.

Logistical Issues
There is tremendous interest in studying early pregnancy, but logistical issues loom large for epidemiologists. The collection ofdaily urines, whether for measuring hCG or menstrual hormones, is a formidable task. Inevitably, there is a tension between ideal protocols (ideal in terms ofthe data one would wish to collect) and a protocol that is acceptable to potential study subjects. Acceptability is a limiting factor because nonparticipation (whether through initial refusals or attrition) will almost always introduce a selection bias, the extent and direction ofwhich can be hard to evaluate except on a judgmental basis (20). To avoid poor response rates and the resulting threat of selection bias, several strategies for early pregnancy research have been bruited.
The approaches can be applied to any reproductive research using markers that require serial samples.
One strategy has been to restrict attention to highly motivated subgroups, for example, infertility patients or women planning a pregnancy who volunteer their participation. This latter group was 98% compliant with a protocol requiring daily urines for up to 6 months (21). The limitation ofthe approach is that for some questions the results obtained in selected subgroups may not apply to the general population. An alternative is to recruit from a wider population base but to use tests or sample collection strategies that impose fewer demands on study subjects and should therefore be commensurate with good participation rates. The risk is that using markers in this way may compromise their sensitivity or specificity. Whether to use a biological effect marker at all, and if so, how to use it, is an issue that can usually be decided in the context of disease misclassification.

Classifying Outcome Accurately and Exactly
That nondifferential misclassification of subjects by exposure attenuates estimates ofeffect is widely recognized (22). Tabular or graphical data are available to epidemiologists for quantifying the bias introduced by different rates and types of exposure misclassification (20,23,24). The prospect of remediating this bias is a compelling and frequently cited reason for interest in biomarkers of exposure.
Misclassification of disease, as opposed to exposure, is a topic discussed less often. Although some work has been done (25,26), we could locate no quantitative data on how misclassification affects measures of association. As such figures are essential to evaluating the costs of disease misclassification, we developed a set of tables describing how error in measuring outcome will bias effect estimates. The tables (available upon request) were generated using Kleinbaum, Kupper, and Morgenstem's general equations for misclassification (26). Estimates of degree and direction ofbias were developed for both nondifferential and differential cases and for case-control and cohort designs. Here we draw on this material as a framework for discussing the use of markers ofearly pregnancy and early loss. In the discussion, sensitivity and specificity refer to the accuracy ofdisease classification compared to a diagnostic gold standard, which in this application is the currently most accurate biological marker. In general, for nondifferential misclassification ofdisease, as for exposure, poor specificity will cause greater attenuation of risk estimates than poor sensitivity. Specificity is decreased when the truly nondiseased are misclassified as diseased. This could arise, for example, with use of an hCG assay that crossreacts with hLH. Even with a highly specific assay, there is risk of this type of misclassification if urines are collected less often than daily (e.g., if the schedule is such that a transient peak in hCG could be interpreted as a sustained rise).
Take the hypothetical case of a prospective early pregnancy study. Assume an exposure prevalence of 20% (perfectly classified, for simplicity) and a 20% frequency of clinically inapparent loss in the unexposed. Using an assay with sensitivity of 99% for detecting loss after implantation and with nondifferential specificity of 80% (i.e., with a uniform 20% misclassification oftruly nondiseased as diseased), atrue relative risk (RR) of 2.00 would be attenuated to 1.44 and a true relative risk of 3.00 would be attenuated to 1.88. Thus, for compromises in protocol that threaten specificity, there is a fairly substantial bias toward the null hypothesis that increases with the size of the true effect and the rarity of disease. Now suppose that a specificity of99% can be assured but that the choice of marker or of strategy for collecting biological samples leads to suboptimal sensitivity. Let us again consider the hypothetical early pregnancy study but this time using an assay (e.g., a commercial pregnancy kit) that has a nondifferential false negative rate on the order of 20% (that is, sensitivity equal to 80%). For a true RR of 2, the estimated RR would be 1.94. For an RR of 3, the estimate would be 2.88. With a sensitivity on the order of 80% there is attenuation, but it is considerably less than in the previous example, where specificity was 80%. In real life, sensitivity of disease classification may vary much more than specificity (down to levels as low as 40%) and under these circumstances could produce considerable attenuation. Lower sensitivity always means ascertaining fewer events, so there will also be a loss in precision.
The case of differential misclassification is more complex than the nondifferential case and hence difficult to summarize. An example will serve to illustrate the possible biases. Suppose an exposure under test preferentially causes early loss. We can estimate the bias that will occur if outcome is determined only on the basis of self-report or medical records of recognized miscarriage rather than with a biochemical assay. Assume a true risk ofpregnancy loss in the unexposed of 30%, using estimates of loss based on urinary hCG as the gold standard (10). Then assume that clinical diagnosis ofpregnancy loss has a specificity of 98% in both the exposed and unexposed. If exposure causes early loss preferentially, the classification ofdisease status will be less sensitive for exposed women than for unexposed women. A greater proportion of losses occurring in the exposed will go undetected because the exposure will have increased the incidence ofclinically inapparent loss. Thus, sensitivity in the 50% range for the unexposed will be reduced in the exposed to 40 %. A true RR of2 would in this case be attenuated to 1.51. At a sensitivity of30% in the exposed, the RR would be estimated as 1.15. Attenuation will be even greater if the true RR is larger, if specificity is lower, or if exposure is nondifferentially misclassified. To retain equivalent precision, sample size requirements will be about double what would be needed had loss been measured with urinary hCG.
Differential misclassification can operate in the other direction too. Exposed women may overreport pregnancy loss through a greater propensity to interpret menstrual irregularity as a miscarriage. This would increase sensitivity in the exposed. In addition, because the overreporting would generate more false positives, specificity would be decreased. These biases are in the same direction. But in some cases, biases can operate in opposite directions with an overall effect that is unpredictable in the absence of a biological marker or some other gold standard. When misclassification is differential, the bias can be severe enough to show an apparent protective effect when the true relative risk is 2, 3, or even 10. The usefulness and cost-effectiveness of a biological marker in improving both validity and precision of a study can often be judged semi-quantitatively when making decisions on study design. Biological markers can be used for all subjects in the study. Or, if cost or acceptability is prohibitive, markers can be used in a pilot study or on a random sample of the study population to estimate sensitivity and specificity ofthe less accurate outcome measure. Techniques for correcting bias using such estimates have been proposed (25,26). However, the estimates based on small samples may not be sufficiently precise to remove the bias completely. Nonetheless, the information obtained will help to indicate the direction and degree of bias.
We have discussed the role that reproductive effect markers can play in observing subclinical events and in reducing disease misclassification. They also have potential uses for investigating pathogenesis.

Investigating Pathogenesis
While some maintain that epidemiology is the study of causes and not mechanisms (27,28), we believe that epidemiologists should try to take mechanism into account as a means of identifying and interpreting exposure-effect relationships (29)(30)(31)(32). In the endeavor to give epidemiology mechanistic underpinnings, biological markers can be an invaluable aid.
In 1986, we began recruiting first-trimester prenatal patients into a longitudinal study of maternal stress during pregnancy, Hypothetcal pahy from materal pychosocial stress to reduced birthweight in offspring. UPBF, uteroplacental blood flow; LBW, low birthweight.
designed to test a prespecified hypothesis about mechanism of action. Briefly, we hypothesized that sustained elevations of catecholamines, the stress hormones epinephrine and norepinephrine, and the metabolite MHPG (methoxyhydroxyphenylglycol) might, because of their vasconstricting action, interfere with uteroplacental blood flow, which in turn could lead to vascular damage in the placenta and ultimately to decreased birthweight and other problems in the offspring (Fig. 1). Catecholamine concentrations, faored by many experts as a sensitive and reliable indicator of a stress response (33), are implicated in experiments with pregnant animals as an intervening factor between stress and adverse outcome (34,35). Fortunately, they do not appear to be altered by a normal pregnancy until the time oflabor (36) and hence could be used for the purpose we intended. A rise in catecholamines results in increased alphaadrenergic activity which may directly cause constriction ofthe uteroplacental arteries. The increased catecholamine levels could also stimulate production ofprostaglandins, potent vasoconstrictors that have been shown to act on the feto-placental bed (37). A moderate decrease in uteroplacental blood flow could induce pathologic changes in the placenta that may adversely affect fetal growth. Table 2 lists the placental vascular abnormalities ofparticular interest to us. In terms of logistics, collection of placental specimens presented few problems. The main requisite was coordination with hospitals where subjects delivered to ensure that all placentas would be submitted to us for pathologic examination, irrespective ofpregnancy outcome. Collection ofurine specimens was potentially more problematic. Although the protocol asked subjects to provide a urine sample only once at a stndard point in gestation, we required a full 24-hr collection rather than the more convenient first-morning or spot sample. Catecholamine excretion rates vary at different times ofthe day. The 24-hr collection is meant to prevent missing any temporal change in peak levels that might be associated with acute or chronic stress.
Obtaining 24-hr urines from 400 women was, in its way, a Sisyphean task. But several strategies helped ensure success. First, a member of our field staffof trained nurses visited each subject in her home for a review of the urine collection procedures. Second, in order to detect possible patterned change in catecholamine levels, we apportioned the 24-hr collection into three 8-hr aliquots. This made collection more manageable for study subjects. An additional benefit is that the aliquots assured us some usable information on women who missed a voiding (we might lose data for one 8-hr. period but not the whole 24 hr). Third, we provided participants with a departnent store shopping bag to use for carrying the urine collection materials (8-hr plastic container, ice pack, styrofoam container) when they left home during the day or evening. Finally, because mood, activity, smoking, drinking, and diet can influence catecholamine secretion, we gave the subjects a diary and asked them to keep a 24-hr log ofsuch data. This helped focus them on the urine collection. As an inducement, we added a section for recording thoughts about the pregnancy as they might do in a conventional diary.
We have now virtualy completed recruitment ofthe full cohort of 900 subjects. Overall, the participation rate among eligible women is about 77 %, a good response for a longitudinal study with a demanding regimen (repeat interviews, blood samples, and in some cases, collection ofurines and placentas). Financial incentives were offered to the first 100 subjects but were dropped after a test period showed equivalent recruitment rates when no compensation was offered. For the subset of 400 women who were asked to collect 24-hr urine specimens, compliance was a renarkable 96%. Of those, 89% have a usable 24-hr collection, 9% have usable 16-hr collections, and the remaining 2% have usable 8-hr collections. Analyses ofurine specimens reported by subjects to be complete show dopamine and norepinephrine levels within the normal ranges for 24-hr samples, indicating adherence to protocol.
The field and laboratory work for the catecholamine and placental components ofthe study have added substantially to the expense, but the costs should be offset by gains in understanding the biology ofpsychosocial stress and the role ofplacental abnormalities in the pathogenesis of low birthweight and other perinatal problems. Perhaps effects of psychosocial stress on reproduction are limited to women with a heightened neuroendocrine response to stressors. Perhaps stress at the levels experienced by the study population ofrural and suburban women has no clinical consequences for offspring. If so, are there more subtle effects detectable as placental changes? Social support has been suggested to ameliorate the effects of stress (38). Does social support do this at the level of physiologic response? The incorporation ofbiological markers allows us to address these interesting and important questions.

Conclusion
There is some controversy about the reliability, validity, and overall utility ofbiological exposure markers in epidemiologic studies. There is generally less debate about the value of biological effect markers, and the field of reproduction is fortunate to have several available. We have focused here on semen quality, menstrual hormones, early pregnancy loss, and placental abnormalities, but there are others, ready for use or under development, that are discussed in an upcoming report by the National Research Council (39). The challenge to reproductive researchers is to choose and use such markers well. Logistical and analytical problems ofcollecting and statistically evaluating large volumes of data may be formidable, but the benefits of effect markers (in terms of improving the power, validity, and cogency ofreproductive studies) can be quantified and will often outweigh the costs. Much remains to be learned about human reproduction, and we need all the tools at our disposal.
We thank our colleagues Mervyn Susser and Diane McLean, and Allen Herman of the National Institute of Child Health and Human Development, with whom strategies were developed for utilizing biological markers in the Columbia Health and Pregnancy Study. We thank Steven Ng for assistance on the computer simulations ofdisease misclassification. George Freidman-Jimenez is supported by training grant #5T32-CA 09529 from the National Cancer Institute.