The temporal reliability of serum estrogens, progesterone, gonadotropins, SHBG and urinary estrogen and progesterone metabolites in premenopausal women

Background There is little existing research to guide researchers in estimating the minimum number of measurement occasions required to obtain reliable estimates of serum estrogens, progesterone, gonadotropins, sex hormone-binding globulin (SHBG), and urinary estrogen and progesterone metabolites in premenopausal women. Methods Using data from a longitudinal study of 34 women with a mean age of 42.3 years (SD = 2.6), we calculated the minimum number of measurement occasions required to obtain reliable estimates of 12 analytes (8 in blood, 4 in urine). Five samples were obtained over 1 year: at baseline, and after 1, 3, 6, and 12 months. We also calculated the percent of true variance accounted for by a single measurement and intraclass correlation coefficients (ICC) between measurement occasions. Results Only 2 of the 12 analytes we examined, SHBG and estrone sulfate (E1S), could be adequately estimated by a single measurement using a minimum reliability standard of having the potential to account for 64% of true variance. Other analytes required from 2 to 12 occasions to account for 81% of the true variance, and 2 to 5 occasions to account for 64% of true variance. ICCs ranged from 0.33 for estradiol (E2) to 0.88 for SHBG. Percent of true variance accounted for by single measurements ranged from 29% for luteinizing hormone (LH) to 92% for SHBG. Conclusions Experimental designs that take the natural variability of these analytes into account by obtaining measurements on a sufficient number of occasions will be rewarded with increased power and accuracy.


Background
Several active research programs are investigating the risk associated with serum estrogens, gonadotropins and urinary sex hormone metabolites for a variety of diseases including breast cancer [1], endometrial cancer [2], and osteoporosis [3]. The results of the few published studies suggest that the natural temporal variability (true variation over time, not variation due to storage or other factors) of some serum estrogens, gonadotropins and urinary sex hormone metabolites is sufficiently great that a single measurement occasion may be inadequate to ensure a reliable estimate [4][5][6]. Published intraclass correlation co-efficients (ICC) vary between 0.06 and 0.62 for estradiol (E 2 ) and between 0.52 and 0.69 for estrone (E 1 ) [4]. Only the percent of free E 2 and of SHBG-bound E 2 have been found to be sufficiently reliable to account for as much as 50% of the variance in the true mean (ICC > 0.7).
The term reliability can refer either to the consistency of a measuring procedure or to the temporal stability of the target of measurement [7]. The definition of temporal reliability used in this study includes both those dimensions, but emphasizes the latter. While researchers can control error due to insufficient repeated measures by increasing the number of measurement occasions, obtaining measurements is expensive. It is therefore useful to have evidence-based guidelines for estimating the minimum number of occasions required to obtain a given degree of reliability for a particular analyte.
All types of measurement error distort, confound, or attenuate the tests of association that constitute one of the primary products of research [8,9]. Figure 1, though not exhaustive, shows the sources of variance in a measurement and the interrelationships between error and tests of model fit or significance.
The relation of a measurement to the object being measured can be represented as: σ O = σ T + σ E , where σ O = variance in the observed measurement of the target, σ T = variance in the true value of the target, and σ E = random variance, or error. If the true value of the target is invariant across measurements, i.e., if σ O = σ E , the observed variance will be purely a function of the unreliability of the measuring instrument. Conversely, if perfectly error-free measurement of the target could be assumed, i.e., if σ E = 0, then σ O = σ T and the observed variance would be purely a function of the temporal stability of the target. If σ E ≠ 0 and σ T ≠ 0, the observed variance will be a function of both the temporal stability of the target and of the unreliability of the measuring instrument.
Measurement error can result from a variety of factors, including true variance not captured by a particular measurement strategy, which may complicate the interpretation of temporal reliability estimates. These other factors include variance due to: fluctuations across cycle phases within each woman's menstrual cycle [10]; duration of sample storage prior to analysis [11]; limitations of the assay; multiple analysis batches [10]; multiple types of assays [12]; and multiple laboratories [10]. Ideally, estimates of as many sources of error as possible should be included when considering the impact of temporal reliability on measurement strategy. The objective of this study was to determine the following for various serum estrogens, gonadotropins, and urinary sex hormone metabolites: the minimum number of repeated measurements required for reliable estimates; the ICCs; and the amount of true variance accounted for by single measurements.

Experimental design
The data for this study come from a randomized doubleblind study investigating the effects of a 100 mg/day soy isoflavone regimen on estrogen levels in 34 premenopausal women. A detailed description of the study design and the results of the intervention were reported in Maskarinec et al., 2002).) [13]. The Committee on Human Studies at the University of Hawaii approved the study protocol. Written informed consent was obtained from each subject, prior to participation. The study group consisted of 17 premenopausal women per group. Four women left the study before the end of the year and another was able to give only four blood draws for health reasons. Eligibility criteria included: an age range of 35-46 years; an average intake of less than 7 servings of soy foods per week; no prior cancer diagnosis (except basal cell skin carcinoma); no use of oral contraceptives or hormone preparations within the past three months; no intention of becoming pregnant within the next year; an intact uterus and ovaries; self-defined regular menstrual periods; no serious medical condition. Subjects had a mean age of 42.3 years (SD = 2.6), and a mean weight of 65.6 kg (SD = 12.8). Subjects were ethnically diverse: 18 were Caucasian; 6 were Chinese; 5 were Japanese; 5 were Hawaiian.

Sample collection
Subjects were asked to donate 5 urine and blood samples, one at baseline and one after 1, 3, 6, and 12 months of participation. All samples were collected approximately 5 days after the ovulation (approximately day 19 in a 28 day cycle). Subjects used ovulation kits (Ovuquick test kits from Quidel, La Jolla, CA) to determine the time of ovulation. This kit detects the mid-cycle rise of LH using morning urine with a sensitivity of 35 mIU/mL of LH and its predictive validity with respect to ovulation has been estimated as 93% [14]. Although the use of a minimum progesterone value to exclude data from anovulatory cycles from the analyses helped ensure acquisition of the mid-luteal phase samples, only 52% of samples were obtained on exactly the 5 th day from ovulation. Ninety-one percent were obtained between the 4 th and the 6 th day from ovulation. Blood samples were drawn at a commercial laboratory, in the morning between 7 and 9 o'clock to control for circadian rhythm in hormone levels. Serum and urine samples were stored at -80°C after separation and aliquoting.

Serum analysis
Hormone assays were conducted at the Department of Obstetrics and Gynecology, University of Southern California (Los Angeles, CA) in the Reproductive Endocrine Research Laboratory. The analyses for E 2 , free E 2 , E 1 , E 1 S, progesterone, SHBG, follicle stimulating hormone (FSH), and LH were conducted in 2 batches. Samples of these analytes collected at baseline, month 1 and month 3 were analyzed in batch 1, and 6-month and 12-month samples were analyzed in batch 2 one year later. E 2 , E 1 , progesterone, FSH, LH, and SHBG were quantified in serum by specific and sensitive radioimmunoassays (RIAs). Prior to RIA, E 1 and E 2 were first extracted with ethyl acetate: hexane (2:3) and then purified by Celite column partition chromatography, using ethylene glycol as stationary phase [15]. E 1 and E 2 were eluted off the column with 15% and 40% toluene in isooctane, respectively. 3 H-E 1 and 3 H-E 2 were used as internal standards to follow procedural losses. FSH and LH levels were determined using an immunoradiometric assay (IRMA). E 1 S, progesterone and SHBG were measured by direct RIAs using kits obtained from Diagnostic Systems Laboratories, Webster, Texas. Free E 2 (non-SHBG or albumin-bound-E 2 ) was determined by calculation using a computerized algorithm described previously).) [16]. The majority of intra-assay CVs for all analytes were below 10% (Table 1) indicating good quality control in the laboratory. They ranged from <0.5% for SHBG to 13.0% in the low concentration range of batch 1 for E 1 .

Statistical analysis
The SAS statistical software package version 8.2 (SAS Institute Inc., Cary, NC, 1999-2001) was used to perform the statistical analyses. All statistics were computed using logged values when raw values were not normally distributed. To ensure that all measurements in the analysis were from the same time in the menstrual cycle, observations were only included if the concurrent progesterone values were at least 5 ng/mL, a minimum value after an ovulation has occurred. Because analyses for 8 of 12 analytes were conducted in two batches, we included consideration of error due to between batch variance in our analysis of the temporal stability of these analytes. Therefore, estimates of temporal stability for the 8 analytes were calculated for the total number of samples and for the first and second batches separately.
Two types of estimates of the number of measurement occasions (O) necessary to obtain an adequately reliable estimate were computed. The first, the relative type (O R ) includes the between-subject variance. O R was computed using the formula proposed by Nelson et al. [19]: where r is the correlation between the observed and the true mean analyte values for an individual over a year, s W 2 is the within-subject variance, and s B 2 is the between-subject variance. Setting r to 0.9 results in a calculation of the number of measurement occasions required to obtain an estimate that would account for 0.9 2 or 81% of the true variance in the target. Ninety-five percent confidence intervals (95% CI) for O R were computed using a published method).) [20].
The second estimate of the number of measurement occasions necessary to obtain an adequately reliable estimate, the absolute type (O A ), includes only within-subject vari- σ w is the within-subject variance [21]. By adjusting the denominator, this method allows for the desired approximation to the true mean to be specified as a percentage.
Setting the denominator to 0.2 results in a calculation of the number of occasions required to obtain an estimate that is within 20% of the true mean. A SAS macro using Proc Varcomp and Proc Means to produce estimates of O R , O A , and related statistics is available from the authors.
ICCs measure the proportion of variance attributable to targets of measurement as a ratio of within-subject variance to total variance [22] and are suitable to compare variables of the same measurement class [23]. We computed two types of ICCs using the notation developed by Shrout and Fleiss [22]: ICC(2,1) was computed for each analyte using all 5 measurement occasions to estimate the temporal reliability of the analyte; ICC(2,k) was computed between batches to estimate the contribution of between-batch variance to the temporal reliability estimate.
where BMS is the between-subjects mean square, EMS is the error mean square, k is the number of observations, OMS is the observations mean square, and n is the number of subjects [22]. ICC (2,k) was computed as We applied the formulas by Shrout and Fleiss [22] to obtain 95% CIs.
To estimate the percentage of true variance accounted for by a single measurement, we assumed that the best available estimate of the true variance was the total variance for all occasions.
After calculating the Pearson correlation of each occasion with all other occasions, we considered the squared average of these correlations as the estimate of the most likely percent of true variance for which a single occasion could account. We used the formula , where % σ T is the percent of true variance, r T is the Pearson correlation of each occasion with the total of all other occasions, and o is the number of occasions.

Results
Overall means, number of samples, and means by measurement occasion for all analytes (Table 2) indicate the overall stability for the analytes over one year. Although estrogen and progesterone levels were on the average 7% higher and gonadotropins and urinary sex hormone metabolites 10% lower in the intervention than in the control group (data not shown), none of the differences was even close to statistical significance (p values ranged from p = 0.16 to p = 0.90 for Estrone-sulfate and Estrone respectively). Because of this homogeneity, results in this study were collapsed across experimental groups. The decrease in E 2 and E 1 are the result of laboratory drift and were independent of intervention status).) [13].
The measurement occasions required to obtain a reliable estimate differed considerably by analyte (    In the case of SHBG (Figure 2), within-subject variance is small relative to between-subject variance. There is little variation within subjects relative to the variation between subjects, resulting in small O R and O A estimates (0.48 and 1.78 respectively). The PDG values (Figure 3) illustrate the case in which within subject variation is high and overlap one another considerably, resulting in relatively large O R and O A estimates (5.17 and 10.27 respectively). Finally, Figure 4 depicts the case in which within-subject variance is small, but so is the variance between subjects. In this case, the small within-subject variance results in a small O A estimate (0.34), but because the within-subject variance is not small relative to the between-subject variance, the O R is relatively large (8.26).
Because ICCs include both within-and between-subject variance, ICCs closely followed O R rather than O A estimates.ICC(2,1) ranged from ICC(2,1) = 0.30 to ICC(2,1) = 0.88 (for LH and SHBG respectively, Table 3). The intraclass correlation coefficient ICCs for absolute agreement between the two analysis batches ranged from ICC (2,1) = 0.47 to ICC (2,1) = 0.96 (for E 1 and SHBG respectively, Table 4). Estimates of ICCs were, generally, consistent across batches, with similar estimates based on analysis of all 5 occasions and for estimates based on each batch. The between batch ICC for E 1 , however, was less than 0.5, suggesting that the batch 1 ICC may be a better indicator than the ICC based on all samples. The percent of true variance accounted for by a single measurement ranged from 29% to 92% for LH and SHBG respectively.

Discussion
We have provided estimates to the minimum number of measurement occasions required to ensure adequate reliability for two types of experimental aims. Analyses in epidemiologic studies involve calculations in which between-subject as well as within-subject variance is important. Therefore, O R will usually be the appropriate index of the minimum number of occasions needed to obtain a reliable estimate. Estimates of O R based on our sample suggest that only SHBG and E 1 S had sufficient temporal stability to be adequately reliable with a single measurement when the desired amount of variance to account for was set as low as 64%. A single measurement of

Figure 2 Sex hormone-binding globulin values for all participants by measurement occasion
any of the other analytes would be unlikely to account for even 50% of the true variance. For cases in which the within-subject variance is the only variance of interest, e.g., when the measured value of an analyte will be compared with a fixed standard, O A will be the appropriate index. The omission of between-subject variance from the formula for calculating this statistic produces very different results from O R . Several of the analytes that were adequately reliable with a single measurement or very few measurements, when between-subject variance was a factor, required higher numbers of measures when only within-subject variance was involved and vice versa.
This study confirms previous findings that SHBG may be reliably measured in premenopausal women using a single occasion. It also indicates that E 1 S may be reliably measured using one sample only. More importantly, our results suggest that none of the other analytes examined meet minimal reliability requirements that would permit confidence in single measures. These results are in agreement with the wide range if ICCs reported in previous studies [4][5][6]. Our conclusions are limited to the collec-tion of samples at midluteal phase, however, and may not generalize to other phases of the menstrual cycle.
The use of ICCs to estimate the agreement between analysis batches differs from their use as an index of temporal reliability. The appropriate type of ICC for this purpose uses a mean of several values rather than single values and is typically higher than that calculated using single values. Though the ICCs between batches were higher than those estimating temporal reliability, they were relatively low, demonstrating the importance of measuring all samples in one batch when possible. As was previously noted [11], error due to time in storage will affect estimates of temporal reliability. Analyzing in multiple batches is one means of decreasing this source of error, but runs the risk of increasing error due to multiple batches. Until better estimates of the impact of storage time on each of these analytes are available, however, it will be difficult to draw conclusions about whether error due to multiple analysis batches or error due to storage time has the more detrimental effect on temporal reliability.

Figure 3 Pregnanediol-3-glucuronide values for all participants by measurement occasion
Several sources of error are effectively beyond researchers' capacity to control. For example, the validity and reliability of the best assay available for measuring a given analyte cannot be increased through improving study design. Other sources of error, however, can be dramatically re-duced through the use of appropriate designs. These strategies may include, increasing the sample size to reduce the impact of random error, analyzing all samples in one batch, and using a sufficient number of repeated measures to obtain an adequately reliable estimate. It is also possi-

Figure 4
Logged estradiol values for all participants by measurement occasion ble, though not uncontroversial, to control error statistically by correcting for attenuation using validation data [24].
Several improvements, in addition to a larger sample and more repeated measures, would have increased confidence in the results of our study. First, if the effects of storage time on the analytes were known, we could have taken into account the contributions of this source of variance to our temporal reliability estimates and distinguished its impact from that due to assay reliability. Second, obtaining blood and urine samples on day 5 following ovulation was most appropriate for the measurement of progesterone and near-optimal for SHBG, but may not have been the best day to obtain estimates of the other analytes [25]. Third, though our data were drawn from an intervention study in which no results approached significance, a more clearly homogeneous sample would have been preferable. Fourth, variation in menstrual cycle length and variance due to pulsatility of excretion were additional sources of error.
Finally, our estimates were based on targets that changed across measurements, and we could not assume error-free measurements. Consequently, we were not able to precisely distinguish between the contributions of assay reliability and the contributions of each analyte's natural variability to our estimates of temporal reliability. However, despite some limitations, this study provided significant new insights into the variability of sex hormones, gonadotropins, and urinary hormone metabolites in premenopausal women during a one-year period. Our estimates of temporal reliability represent the combined computation of the consistency of a measure across repeated measurements and the temporal fluctuations in the target of measurement.

Conclusions
Publish with Bio Med Central and every scientist can read your work free of charge