Overview and quality assurance for the oral health component of the National Health and Nutrition Examination Survey (NHANES), 2011–2014

Background Following implementation in 2009–2010 to the oral health component for the National Health and Nutrition Examination Survey (NHANES), a full-mouth periodontal examination was continued during 2011–2014. Additionally, a comprehensive dental caries assessment was re-introduced in 2011 after a 6-year absence from NHANES. This report provides oral health content information and results of dental examiner reliability statistics for key intraoral assessments conducted by dentists during 2011–2014. Methods During the 2011–2014 NHANES 17,463 persons age 1 and older representing the US civilian, non-institutionalized population received an oral health examination. From this group, 387 individuals underwent a repeat examination conducted by the survey reference examiner. A combination of examiner training and calibration, electronic data capture, and ongoing performance evaluation with statistical monitoring was used to ensure conformance with NHANES protocols and data comparability to prior data collection periods. Results During 2011–2014, the Kappa statistics for the tooth count assessment ranged from 0.96 to 1.00, for untreated dental caries Kappa scores were 0.93 to 1.00. The overall Kappa statistics for identifying combined moderate-severe periodontitis using the CDC/AAP case definition was 0.66 and 0.69 with percent agreement of 83 to 85% during 2011–2014. When evaluating inter-examiner agreement using information collected from 3 periodontal sites for comparability to the NHANES 2003–04 periodontal examination protocols, Kappa scores for combined moderate-severe periodontitis was 0.65 and 0.80 during 2011–2014. For total mean attachment loss and pocket depth across all 6 periodontal sites, the inter-class coefficients (ICCs) ranged from 0.80–0.90 and 0.79–0.86 respectively. Site-specific mean attachment loss ICCs were generally higher for the 4 interproximal measurements compared to the 2 mid-site probing measurements and this observation was similar in 2009–2010. Conclusion During 2011–2014, results overall indicate a high level of data quality and substantial examiner reliability for tooth count and dentition; reliability for periodontal disease, across various assessments, was at least moderate. When comparing the 2011–2014 examiner performance to findings from 2003 to 2004, comparable concordance between the examiners and the reference examiner exists.


Background
For the past fifty years, a number of health examination surveys of the US population have been conducted by Centers for Disease Control and Prevention (CDC), National Center for Health Statistics (NCHS). Between 1959 and 1994, these health examinations were periodic. The earlier surveys were known as the National Health Examination Survey (NHES) and were later renamed the National Health and Nutrition Examination Survey (NHANES) when a dietary assessment was added in 1970. Since 1999, NHANES has been a continuous, annual survey. The current NHANES is a nationally representative sample for each year of data collection but data are released in two-year periods to protect confidentiality and increase statistical reliability. Some of the primary objectives of NHANES include: monitoring the prevalence, awareness, treatment and control of selected diseases and health conditions in the U.S. and assessing factors that affect morbidity, mortality and quality of life.
NHANES is the main source of information on oral health (OH) status in the US. This includes data on tooth loss, dental caries, and periodontal disease among other OH conditions. Since 1999, a few novel OH assessments have been introduced to NHANES. Information has been collected on dental erosion and tooth wear, functional occlusal contacts, and OH related quality of life [1]. Using surplus sera, antibodies to nearly two dozen bacteria, including established periodontal pathogens, have been quantified using checkerboard immunoblotting [2] and periodontal status was evaluated with a full-mouth probing of six periodontal sites around each tooth [3,4]. More recently, information on home water fluoride and plasma levels for youths were collected and a unique imaging approach was used to assess for dental fluorosis [5,6].
During the past decade, NHANES has made some important contributions to the methodology of OH surveillance. A basic screening exam (BSE) was conducted between 2005 and 2008 and was modeled on a dental caries and sealant screening protocol developed by the American Association of State and Territorial Dental Directors [7,8]. The BSE was designed to be used in epidemiological settings to collect subject-level information when resources are limited and dentists are not available to conduct the screenings. Another important methodological contribution to OH surveillance that has occurred with NHANES has been the exploration of alternative approaches to the typical resource-intensive examination for periodontal diseases. This has contributed to the development of standard case definitions for surveillance of periodontitis [9,10] and the development and evaluation of self-report measures for predicting prevalence of periodontitis [11,12]. These efforts also have contributed to international recognition of the need for basic reporting guidelines for periodontal disease measures in epidemiologic studies [13].
Quality assurance processes for national surveys are critically important for the collection of accurate data given that the information originating from these data is frequently used to inform national health policy discussion and legislation. The implementation and reporting of systematic quality assurance procedures is needed to minimize errors in data collection and improve data reliability and transparency, which facilitate the generation of valid findings and the acceptability of results by the public [14]. NHANES is an important source for nationally representative oral health data. The simultaneous collection of hundreds of other health data points, including physical measurements, diet, laboratory findings, and genetic samples, enables researchers to develop and test hypotheses on a plethora of health-related issues. Equally important, these data are very useful for assessing changes in population health status over time. The aim of this methodological paper is to describe important aspects of the 2011-2014 NHANES oral health examination and to report on the quality assurance (QA) program by examining data quality, including statistical analyses of reliability for the dentition and periodontal assessments.

NHANES 2011-2014 overview
The civilian, non-instutionalized population living in the 50 United States and the District of Columbia was the target population for NHANES 2011-2014. The survey used a stratified, multistage probability sampling design to select participants. The 2011-2014 sampling domains were defined by age group, sex, low-income status, and race and Hispanic origin. As part of the sample design, the survey over samples some population subgroups to facilitate the calculation of more precise estimates for these groups. Over sampling for the 2011-2014 survey included all Hispanics, non-Hispanic Blacks, non-Hispanic Asians, some low-income persons, and non-Hispanic white and other persons 80 years and older. Sample design characteristics for NHANES 2011-2014 are presented in Table 1. Informed consent was obtained for all participants and all data collection protocols are approved by the NCHS Research Ethics Review Board. Additional information including protocol numbers are available here: https://www.cdc. gov/nchs/nhanes/irba98.htm As in previous NHANES, participants are interviewed in their homes and then given a complete health examination, which includes the collection of biologic specimens for lab testing at a mobile examination center (MEC). Following household identification and the administration of a screening questionnaire by field interviewers to determine participant eligibility, informed consent is obtained for the home interview. Trained interviewers conduct the home interview, which has two main components: the household and sample person interviews. Following completion of the home interview, interviewed participants are asked to participate in a health examination in a MEC and undergo a second series of informed consent procedures for the health examination. Additional information on the background and content of NHANES can be found at: http://www.cdc.gov/nchs/about/major/nhanes/datalink.htm.

NHANES 2011-2014 Oral health component
The 2011-2014 OH component was comprised of information collected during the home interview, the MEC examination, and specimens for lab analyses. The home interview OH questions covered topics such as dental visit frequency, perceived OH status, the receipt of preventive health information, oral pain, periodontal disease and oral cancer. For 2013-2014, a sample of tap water from the home of persons age 19 and younger was collected to test for fluoride level. Although sampled persons age 1 and older were generally eligible for the OH interview and exam in 2011-2014, specific OH assessments were age-based and eligibility varied among study participants (Table 2).
For 2011-2014, the MEC oral health exam was performed by dentists with a dental license in at least one U.S. jurisdiction and trained in NHANES survey methods. To assist with data entry, other MEC personnel were trained as dental recorders. NHANES has two full-time MEC teams collecting data concurrently with a primary dentist assigned to each team. During 2011-2014, there were three primary dental examiners assigned to the MEC teams and they performed nearly 94% of all dental examinations in 2011-2014. All dental examiners performing OH examinations in this four-year period were trained and calibrated by Dr. Dye (author), as were the examiners in the 1999-2004 NHANES cycle. All dental examinations took place in a designated room inside the MEC that includes a portable dental chair, light, and compressed air, and digital imaging equipment.
If the sampled person was age 30 or older, a series of health screening questions were administered by the dentist to determine eligibility for the periodontal examination. If a participant acknowledged having a heart transplant, an artificial heart valve, congenital heart disease, or ever having bacterial endocarditis, they were excluded from periodontal probing. In 2011-2014, approximately 4% of the assessed adults were excluded from the periodontal examination because of medical history concerns.
For all sample persons age 1 and older, the first OH assessment began with "tooth count" to identify the presence or absence of permanent and/or primary teeth, including retained dental root tips and dental implants. The second assessment was to identify which teeth had been affected by the dental caries process, which could later be used to help create the classic "DMFT" score (decayed, missing, and filled teeth). Following the caries assessment, evaluations for the presence of dental sealants and dental fluorosis were completed for children and adolescents. Additionally, persons age 6-19 undergoing a standard blood draw had a portion of their blood tested for plasma fluoride concentration in 2013-2014.
Adults aged 30 years and older were eligible for a full-mouth periodontal examination (FMPE) if they had a least one natural permanent tooth present. Gingival recession and pocket depth measures were made at six sites per tooth (the distal-facial (DF), mid-facial (BF), mesio-facial (MF), distal-lingual (DL), mid-lingual (BL), and mesio-lingual (ML)) sites using a HU-Friedy periodontal probe color-coded and graduated in 2-mm increments. All four quadrants were examined, and all

Quality assurance
Dental examiners underwent a comprehensive training and calibration period, including periodic monitoring and recalibration, to ensure quality OH data. During the training phase, trainees participated in a comprehensive presentation designed to familiarize them with the study protocols. This slide presentation covered assessment criteria, data recording, infection control procedures, and emergency preparedness guidelines. Following a demonstration examination, a series of standardization sessions were conducted where the reference examiner and trainee examined the same set of approximately 60-65 volunteers. Trainees were encouraged to ask questions during their examinations with feedback provided during each examination round to minimize differences in the application of criteria and coding. Following the standardization sessions, a set of preliminary calibration sessions was performed on an additional 30-35 volunteers and the trainees conducted independent replicate examinations without discussion. Data from these calibration sessions were analyzed to assess consistency between each trainee and the reference examiner. Training typically lasted for 40 h and was conducted in the Washington, DC, metropolitan area. A final calibration session was conducted at the MEC and was performed during normal field operations over a three to four-day period for each examiner.
There are three main activities employed to facilitate the ongoing collection of quality OH data on NHANES. To reduce data recording errors, all data are directly entered into an electronic data management system and automated data management utilities are used to check for out of range values at the time of entry. Project managers periodically observe staff performing data collection to ensure that protocols are followed. Finally, the reference examiner visits each dental examiner approximately twice a year to replicate approximately 25-to-30 random OH examinations during each MEC visit. Data from these replicate exams were used to produce inter-rater reliability statistics to objectively evaluate examiner agreement. Although dental examiners were aware of the inter-rater evaluations being conducted, examiners are blinded to each other's observations. Because all inter-rater reliability statistics were calculated after the MEC evaluation visits, the findings were unavailable for real time use. However, all examiners did receive general feedback pertaining to their performance immediately following each evaluation visit by the reference examiner.
To ascertain examiner agreement compared to the reference examiner, percent agreement, Kappa statistics, and inter-class correlation coefficients (ICCs) were calculated using SAS software (version 9.3, SAS Institute Inc., Cary, NC). For purposes of this report, reliability statistics were analyzed and presented for tooth count, dental caries and sealants, and periodontal status. The calculation of interexaminer reliability statistics followed similar procedures used for previous NHANES oral health reliability reports to facilitate comparisons [1,4,8,15]. Kappa statistics calculated for tooth count assessed four indicators: complete tooth loss, retention of all third molars, having at least one retained root tip, and tooth retention, which was the total number of primary and permanent teeth present. Kappa statistics were calculated for detecting at least one tooth with dental caries experience or only untreated caries in either primary dentition or in permanent dentition. For the evaluation of periodontal status, a number of different definitions and categories were used including the CDC Periodontitis Workgroup (CDC/AAP) case definitions, which have been recommended for periodontitis surveillance [9,10]. In addition to calculating Kappa statistics for categorical periodontal status variables, inter-rater reliability for continuous variables was assessed by comparing ICCs generated from subject-level means (mm) for loss of attachment (AL) and pocket depth (PD) using measurements obtained from all six periodontal sites. Reliability statistics were calculated at the person (or subject) level and are unweighted. All results presented in this report do not incorporate the NHANES sample weights. Table 3 shows  Table 4. Examiner A is the same person in both periods (2011-2012 and 2013-2014), whereas examiner B represents a different examiner for each of the two data collection periods. Overall, the Kappa statistics for the tooth count assessment ranged from 0.96 to 1.00 with a percent agreement ranging from 99 to 100% for the combined period during 2011-2014. For any dental caries in the primary dentition, Kappa scores were 0.93 and 1.00 and percent agreement was 97 and 100% during 2011-2014. For any untreated dental caries, the Kappa scores were 0.74 and 1.00 and percent agreement was 90 and 100% during the same period. Between 2011 and 2012 and 2013-2014, Examiner A matched the reference examiner's results for caries experience in primary teeth. For dental caries in the permanent dentition, Kappa scores were 0.93 and 0.96 and percent agreement was 97 and 98% during 2011-2014. For untreated dental caries, the Kappa scores were 0.82 and 0.91 and percent agreement was 94 and 97% during the same period. The results from Examiner A regarding replicating the reference examiner for caries experience in the permanent dentition was 0.    The assessment of inter-examiner reliability for key continuous measures of periodontal status, such as mean AL, mean PD, and mean recession (CJ) are presented in Table 5.

Discussion
Oral health data from NHANES is used for two important purposes: as the principle source of information for surveillance of dental diseases and OH related issues at the national level and to explore potential associations between OH status and other conditions. During the 2011-2014 NHANES, the oral health component was modified to make the oral assessments comparable to many procedures utilized during 1999-2004. Unlike 2005-2010, where the OH examination was administered by a registered health technologist or dental hygienist, beginning in 2011 the OH examination was administered by a dentist. In 2011, the assessment for dental caries at the toothsurface level and an evaluation for dental fluorosis were re-introduced and are comparable to the procedures used during 1999-2004. In 2013, a digital imaging process was added to explore the utility of assessing dental fluorosis indirectly-through images, rather than a direct clinical assessment [5,16]. The full mouth periodontal examination (FMPE) was continued during 2011-2014 producing six years of continuous FMPE data. This is the fifth paper in a series of methodology reports that cover 16 years of OH data collection since the introduction of the continuous NHANES in 1999 [1,4,8,15]. The information is organized and presented to facilitate comparisons between the various survey periods, for example, when comparing examiner reliability for determining dental caries experience from 2001 to 2004 to 2011-2014, or using examiner data from the FMPE to calculate partial mouth periodontal examiner reliability statistics for 2011-2014 to permit comparison to previously reported statistics from 2003 to 2004. As with prior papers, two important measures of examiner reliability, inter-class correlation coefficients for continuous data and Kappa statistics for categorical data, were examined to compare data reliability across multiple years. Although percent (proportion) agreements are presented, Kappa statistics are typically preferable to better differentiate between examiners as a measure for reliability. There is a subtle difference between the concepts of reliability and agreement. Reliability is a ratio of variability and agreement is a proportion of identical ratings [17]. An alternate concept behind the preference for Kappa over percent agreement is that the Kappa computation corrects the agreement for what would be expected by chance alone [18]. This mathematical adjustment usually produces a value that is lower compared to the basic calculation for percent agreement, but when the expected agreement is due to greater chance (guessing), the resulting lower Kappa calculation is also greater.
Guidelines proposed by Landis and Koch for interpreting kappa scores have been used by many researchers across many different health fields. To summarize Landis and Koch: a kappa statistic ≤0 is reflective of having "poor agreement", a score of > 0 but ≤0.20 is "slight agreement", 0.21-0.40 is "fair agreement", 0.41-0.60 is "moderate agreement", 0.61-0.80 is "substantial agreement", and > 0.80 is "almost perfect agreement" [19]. To maintain consistency in interpretation with previous NHANES oral health data quality reports, we have used the guidelines suggested by Landis and Koch to evaluate examiner performance. An alternate, more "user-friendly" approach to interpreting Kappa is the use of 0.60 as a threshold, where any Kappa below 0.60 suggests inadequate reliability between an examiner and the reference [20]. Although the OH assessments implemented on the national health surveys are designed to minimize examiner subjectivity, some are inherently more difficult and more subjective than others.
Because the trainer and reference examiner was the same person, individual examiner agreement with the reference examiner was expected to be high. Examiner agreement was almost perfect for edentulism, tooth  To assess examiner reliability when using continuous measures of periodontal status, ICCs were calculated. The basic formula construct for the computation of an ICC is: reliability = variability between examiners / (variability between examiners + measurement error). The resulting statistic will range from 0 to 1, the range implying completely unreliable to perfect reliability [21]. An important consideration when interpreting ICCs is examiner bias or measurement error. As measurement error approaches the same value as examiner variability, the calculated ICC becomes closer to 0.5. When utilizing ICCs in dental research, it has been suggested that a threshold of 0.75 or greater would represent excellent reliability [22]. For the 2011-2014 periodontal data, the mean attachment loss and pocket depth ICC statistics as measured for all six sites around all teeth was 0.84 and 0.86 for each examiner, indicating excellent overall reliability for AL and PD. The ICC statistics for recession were also excellent (0.90 and 0.93). When evaluating the individual sites, the mid-facial and mid-lingual pocket depth sites were the most challenging for reproducibility among the examiners, specifically Examiner B (2011-2012) with the reference. For Examiner B, the ICC statistics were 0.28 (mid-facial) and 0.55 (mid-lingual). Although the ICCs from these two sites indicate that data quality is less than optimal, information obtained from the mid-tooth sites are not used to derive the CDC/AAP case definitions for periodontitis.
The difference between examiners with regards to performance is most striking for the mid-tooth sites. Across both data collection periods, the ICC statistics for Examiner A for the mid-facial PD was 0.65 and 0.66; 0.70 and 0.74 for mid-lingual PD. Examiner B during 2013-2014 was more comparable with Examiner A. Interestingly, during 2009-2010, the tooth sites presenting the greatest challenge to data reliability for pocket depth measures were again the mid-facial and the mid-lingual periodontal sites. During this period, the examiners were not dentists, but dental hygienists, and the ICC statistics ranged from 0.15 to 0.59 for both periodontal sites. In contrast, for both examiners during 2009-2010, the calculated ICC statistics for overall mean attachment loss for all six sites around all teeth was 0.80 and 0.88, indicating excellent overall reliability for AL. The performance of examiners between 2009 and 2010 (dental hygienists) and 2011-2014 (dentists) were remarkably similar for overall mean AL with ICCs ranging from 0.80 to 0.90. There was considerable consistency when comparing each examiner (both dental hygienists and dentists) with the reference for detecting moderate and severe periodontitis from 2009 to 2014 as well. When summarizing assessments involving periodontal status, the Kappa statistics for the three main examiners during 2011-2012 and in 2013-2014 ranged from .45 to 1.0 and inter-class coefficients ranged from 0.28 to 0.96. Generally, achieving greater inter-examiner reliability was better for clinical recession compared to pocket depth and the area of greatest challenge was measuring pocket depth at the mid-facial and mid-lingual sites. Because pocket depth is an essential component for calculating attachment loss, aggregate measures of periodontitis can be affected by increase variability in inter-examiner reliability when examiners are not consistent across all measured sites. Our findings suggests that if these sites are to be measured in future epidemiologic studies for periodontal status, additional measures for training and monitoring should be considered to reduce potential measurement error and to facilitate examiner reliability.
When comparing 2013-2014 examiner reliability to 2003-2004, examiner performance as measured by Kappa appears to be generally similar. Examiners from 2013 to 2014 appear to have slightly higher agreement with the reference when detecting untreated caries in primary and permanent teeth; and similar agreement when assessing for tooth retention and dental sealants. When using comparable methodology assessing periodontal status at 3 dental sites characterized by the standard partial-mouth periodontal examination administered during 2003-2004, examiners from 2003 to 2004 and 2013-2014 performed similarly. Overall, examiners from both periods assessed periodontal disease with adequate reliability (Kappa > 0.60) and reliability was excellent for mean loss of attachment (ICC > 0.75). However, reliability was less than optimal for mean probing depth for Examiner A during 2003-2004 (ICC < 0.75).
There are several important considerations for analysts using the NHANES oral health data. Regarding measurement of oral diseases and conditions, data collection methodology has periodically changed between 1999 and 2014. During 1999-2004, dental caries followed a comprehensive, dental surface-level assessment for each tooth and periodontal assessments were made using a partial-mouth examination. From 2005 to 2008, a Basic Screening Examination (BSE) assessing dental caries and sealants was utilized and this was administered by health technologists. During 2009-2010, dental hygienists conducted a fullmouth periodontal examination (FMPE) and the BSE. Between 2011 and 2014, the dental examiners were dentists and they conducted a FMPE and a comprehensive assessment for dental caries using protocol comparable to 1999-2004. Evaluating examiner reliability represents one element of quality assurance. Another element is assessing the validity of the method used to collect data. Although data collection methods periodically changed over the years, the expectation for accuracy remained high because accepted protocols were utilized. Analysts also should be cautious regarding interpretation of two-year OH estimates calculated from the continuous NHANES. In addition to sampling variations because of the survey design or methodology changes affecting data collection, insufficient sample size for some subgroups could produce statistically unreliable estimates. Researchers should routinely evaluate whether the denominator count is > 30 and the relative standard error is less than 30%. For most OH calculations, using four years of data for analyses will reduce the effect of sampling variation between the two two-year periods and produce more accurate estimates.

Conclusion
Overall, findings from this report indicate that the assessments of OH data reliability, for NHANES 2011 to 2014, are considered to reflect at least substantial agreement for tooth count and dentition. For periodontal disease the majority of assessments showed at least moderate agreement but some assessments fell below what would be considered adequate agreement (< 0.6 kappa value). Evaluations of inter-examiner reliability for the FMPE conducted between 2009 to 2014 indicate adequate reliability among the examiners (both dental hygienists and dentists) when compared to the reference for several measurement indicators of periodontal status, including the detection of moderate and severe periodontitis as well as mean AL and PD. The main area of concern is measurements made of a tooth at the mid-facial and to a lesser extent, measurements made at the mid-lingual site. When evaluating examiners during 2011 to 2014 and from 2003 to 2004, comparable reliability among examiners exists. Finally, the 2011-14 QA findings presented in this report add to the continuous quality monitoring of the OH examination on NHANES. These inter-examiner findings provide insight into data reliability monitoring in a large-scale ongoing epidemiologic study.