Next Article in Journal
Effects of Optical Beams on MIMO Visible Light Communication Channel Characteristics
Next Article in Special Issue
Monitoring Respiratory Motion during VMAT Treatment Delivery Using Ultra-Wideband Radar
Previous Article in Journal
Multi-Incidence Holographic Profilometry for Large Gradient Surfaces with Sub-Micron Focusing Accuracy
Previous Article in Special Issue
Automatic Separation of Respiratory Flow from Motion in Thermal Videos for Infant Apnea Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Infrared Thermography for Measuring Elevated Body Temperature: Clinical Accuracy, Calibration, and Evaluation

1
Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, MD 20993, USA
2
Department of Mechanical Engineering, University of Maryland, Baltimore County, Baltimore, MD 21250, USA
3
University Health Center, University of Maryland, College Park, MD 20742, USA
*
Author to whom correspondence should be addressed.
Sensors 2022, 22(1), 215; https://doi.org/10.3390/s22010215
Submission received: 30 October 2021 / Revised: 6 December 2021 / Accepted: 20 December 2021 / Published: 29 December 2021
(This article belongs to the Special Issue Contactless Sensors for Healthcare)

Abstract

:
Infrared thermographs (IRTs) implemented according to standardized best practices have shown strong potential for detecting elevated body temperatures (EBT), which may be useful in clinical settings and during infectious disease epidemics. However, optimal IRT calibration methods have not been established and the clinical performance of these devices relative to the more common non-contact infrared thermometers (NCITs) remains unclear. In addition to confirming the findings of our preliminary analysis of clinical study results, the primary intent of this study was to compare methods for IRT calibration and identify best practices for assessing the performance of IRTs intended to detect EBT. A key secondary aim was to compare IRT clinical accuracy to that of NCITs. We performed a clinical thermographic imaging study of more than 1000 subjects, acquiring temperature data from several facial locations that, along with reference oral temperatures, were used to calibrate two IRT systems based on seven different regression methods. Oral temperatures imputed from facial data were used to evaluate IRT clinical accuracy based on metrics such as clinical bias ( Δ c b ), repeatability, root-mean-square difference, and sensitivity/specificity. We proposed several calibration approaches designed to account for the non-uniform data density across the temperature range and a constant offset approach tended to show better ability to detect EBT. As in our prior study, inner canthi or full-face maximum temperatures provided the highest clinical accuracy. With an optimal calibration approach, these methods achieved a Δ c b between ±0.03 °C with standard deviation ( σ Δ c b ) less than 0.3 °C, and sensitivity/specificity between 84% and 94%. Results of forehead-center measurements with NCITs or IRTs indicated reduced performance. An analysis of the complete clinical data set confirms the essential findings of our preliminary evaluation, with minor differences. Our findings provide novel insights into methods and metrics for the clinical accuracy assessment of IRTs. Furthermore, our results indicate that calibration approaches providing the highest clinical accuracy in the 37–38.5 °C range may be most effective for measuring EBT. While device performance depends on many factors, IRTs can provide superior performance to NCITs.

1. Introduction

Fever is a key symptom of many infectious diseases that have produced epidemics, including Severe Acute Respiratory Syndrome (SARS) in 2003, Influenza A (H1N1) in 2009, Ebola Virus Disease (EVD) in 2014, and Coronavirus (COVID-19) in 2019–present [1,2,3,4,5,6]. While fever screening alone is not an effective method to stop an epidemic, it is likely that for many infectious diseases it can be part of a larger approach to risk management. In several recent epidemics, fever screening has been used in high-traffic areas and at the entrances of high-risk sites, such as public transportation hubs, hospitals, and assisted living facilities, yet there is little evidence that this approach has made a significant impact [7]. This may be due in part to the implementation of ineffective instrumentation and calibration algorithms, as well as a lack of viable, consistently applied standard procedures for deployment and screening.
Body temperature can be measured at different body sites. These measurements can be used to impute temperatures at other body sites that are more meaningful, but less convenient to access. The site where the temperature is acquired is called the measurement site, whereas the site to which the device output temperature refers is called the reference site. For example, a non-contact infrared thermometer (NCIT) might measure skin temperature on the forehead and convert this value to an imputed oral temperature for display. In this case, the forehead-center is the measurement site and the oral cavity (e.g., sublingual) is the reference site. The process of imputing reference site temperature from measurement site temperature is called site conversion. The measurement and reference sites can be the same (same-site measurement) or different (cross-site measurement).
Through autonomic physiological mechanisms, humans can maintain internal temperature (also known as core body temperature) within very narrow limits despite wide fluctuations in ambient air temperature, so as to ensure proper physiological function [8]. Human thermoregulation processes include chemical reactions, perfusion inside the body, and heat transfer with the environment through radiation, conduction, convection, and evaporation. Temperatures at different peripheral body sites can be quite different and have more fluctuation due to factors such as ambient temperature [9,10], exercise [11], metabolic rate [12], circadian rhythm [13,14], age [15], and menstrual cycle [16]. Therefore, it is difficult to accurately define the relation between temperatures at two different body sites with a mathematical model due to the complexity of human thermoregulation mechanisms. Thus, the accuracy of output temperature from a cross-site measurement is often lower than that from a same-site measurement, since imputing the reference site temperature from the measurement site temperature will increase cumulative error.
NCITs [17,18] and infrared thermographs (IRTs, also known as thermal cameras) [19] represent the primary device types currently used in practice for fever screening during epidemics. IRTs and NCITs use similar principles for temperature measurement. Although NCITs are highly portable, inexpensive, and have been widely used for fever screening during epidemics [20], their accuracy has been called into question, particularly relative to IRTs [21,22]. This may be due to a range of factors including the common use of forehead measurement locations, which tend to be more susceptible to fluctuations due to environmental factors like ambient temperature and airflow [23]. The effectiveness of prior IRT-based approaches to reduce the spread of disease has also been mixed. While some human subject studies demonstrated that IRTs can estimate body temperature with moderately high accuracy [21,24,25,26], others indicated that IRTs are not effective for fever screening [27,28,29]. In many situations, it may not be practical to implement all of the required controls necessary to ensure a high degree of thermal screening performance. Low IRT effectiveness may also be attributable in part to the use of IRTs with insufficient performance specifications, improper deployment practices [30,31], and/or a lack of febrile subjects in clinical studies.
Laboratory accuracy [32] is a key performance characteristic of IRTs. International standard IEC 80601-2-59:2017 provides recommendations for laboratory accuracy evaluation of fever-screening IRTs [30]. However, clinical accuracy determined from a clinical study is much more relevant since it incorporates real-world variability due to the device, subjects and environment, as well as the temperature conversion step between measurement and reference sites. Currently, there are no consensus methods to evaluate the clinical accuracy of IRTs. A technical report, ISO/TR 13154:2017 [31], describes best practices for IRT deployment, implementation and operation, yet evaluation of IRT clinical accuracy is not covered. Two international standards which address methods to evaluate the clinical accuracy of thermometers, namely ASTM E1965-98:2016 [33] and ISO 80601-2-56:2017 [34], provide relevant insights, yet they have not been adapted for use in IRT performance testing.
During clinical studies, temperatures should be measured both with the IRT on the face and a clinical thermometer with established clinical accuracy at the reference site. While the literature indicates that a number of internal tissue sites, including the pulmonary artery [35], esophagus, urinary bladder, and rectum [36], are suitable for estimating core temperature, they are impractical for large-scale clinical fever screening studies. Tympanic membrane and oral cavity thermometry are often used, however, the former approach has shown poor performance in some studies because of dirt/cerumen, inaccurate placement and lack of skill of the measurer [36,37,38]. Oral thermometry provides a well-correlated surrogate location for core temperature and is not very susceptible to confounding factors [36,39,40].
In our recent prior article [41], we provided an initial analysis of our clinical study data, focusing on the 596 subjects measured within the room temperature range of 20–24 °C. In the current work, we have analyzed the entire dataset of more than 1000 subjects measured within the room temperature range of 20–29 °C. Our primary intent of this study was to compare methods for IRT calibration based on clinical data and identify best practices for assessing the clinical performance of IRTs intended to detect elevated body temperatures (EBT). A key secondary aim was to compare IRT clinical accuracy to that of NCITs. Specifically, we (a) acquired IRT and reference temperature data in febrile and non-febrile subjects using methods that closely adhered to international standards, (b) analyzed the relationship between reference temperature and facial temperatures at different locations, (c) evaluated the impact of different training/calibration techniques on clinical accuracy, (d) compared different metrics as clinical accuracy indicators, and (e) compared results to similar data from NCITs.

2. Methods

Over the course of 18 months, from November 2016 to May 2018, we conducted a clinical study at the Health Center of the University of Maryland (UMD) at College Park according to the guidelines of the Declaration of Helsinki. The study was approved by both FDA and UMD Institutional Review Boards under FDA IRB study #16-011R and written informed consent was obtained from all subjects.

2.1. Experimental Setup and Temperature Measurement Procedure

The primary devices used included an oral thermometer (SureTemp Plus 690, Welch Allyn, San Diego, CA, USA) with established clinical accuracy, a webcam (C920, Logitech, Lausanne, Switzerland), two IRTs (IRT-1: 320 × 240 pixels, A325sc, FLIR Systems Inc., Nashua, NH, USA; IRT-2: 640 × 512 pixels, 8640 P-series, Infrared Cameras Inc., Beaumont, TX, USA), a blackbody (SR-33, CI Systems Inc., Carrollton, TX, USA) as the external temperature reference source (ETRS) for temperature drift compensation, and six models of NCITs. The laboratory accuracy of both IRT systems satisfied the IEC 80601-2-59:2017 standard requirements [30] in terms of stability, drift, minimum resolvable temperature difference, and radiometric temperature laboratory accuracy, as shown in our previous study [32]. An IRT system (also known as a screening thermograph) is composed of an IRT and an ETRS. [30,32]. For brevity, we call an IRT system an IRT in this paper.
The study lasted for 18 months covering all four seasons, which can explain why we had a wide ambient temperature range of 20–29 °C due to inefficient air conditioning in summer. To minimize the influence of outside temperature, each subject was preconditioned by waiting for at least 15 min in the draft free study area inside the building before starting the measurements. For each subject, four rounds of measurements were performed within ~15 min. During each round, temperatures were measured with two different IRTs, six models of NCITs and a contact oral thermometer.
The IRTs used skin emissivity and ambient temperature as input parameters to calculate skin temperature automatically. Publications have suggested that the emissivity values of the anterior surface of the eyeball and skin are 0.975 [42] and 0.98 [43,44], respectively. Therefore, skin emissivity of 0.98 was used as an IRT input parameter, which is also recommended by the IEC 80601-2-59:2017 standard [30]. The ambient temperature was also measured with a weather tracker prior to each measurement as an IRT input parameter. We did not perform any other laboratory calibration/correction except for the temperature compensation with an ETRS (see Section 2.3.1 in our previous publication [41] for details; the ETRS emissivity value of 0.98 was used in our algorithm as suggested by the manufacturer).
Temperature measured with the contact oral thermometer was used as the reference ( T r e f ). NCIT measurements performed in this study are addressed in greater depth elsewhere [45]. Additional information about the study methods (e.g., device setup, environmental control, measurement procedure) can be found in our published paper [41]. Ideally, the ambient temperature should be 20–24 °C and relative humidity 10–50%, based on the ISO/TR 13154 document [31]. In our study, however, ambient temperature was between 20 and 29 °C, and relative humidity was between 10% and 62% (Figure 1). While beyond the recommended ranges, these conditions more realistically emulate real-world fever screening settings.

2.2. Subject Demographics

Data were acquired and analyzed from a total of 1020 subjects for IRT-1 and 1010 subjects for IRT-2. Demographic information for study subjects is summarized in Table 1. Overall, about 11% of these subjects exhibited reference temperature above 37.5 °C.

2.3. Facial Region Delineation and Temperature Measurement

We identified facial key-points in IRT images by matching landmarks on visible light images to thermal images with an image registration approach [46] as well as manual labeling. Based on the identified facial key-points, different regions/points on thermal images were defined and the temperatures at these regions were obtained from thermal images (Figure 2). Since IRTs exhibit varying degrees of instability and drift [32], all IRT-measured temperatures were compensated with a blackbody (ETRS) in the system. Details about the definitions of these temperatures and temperature compensation with an ETRS can be found in Section 2.2 and Section 2.3.1, respectively, in our previous publication [41].
For brevity, we restricted our analysis to four main facial temperatures ( T s k i n ): T F C , T F C m a x , T C E m a x , and T m a x . Inner canthi are considered to be optimal locations for non-contact temperature measurement [30]. Perfused by the internal carotid artery, they are typically the warmest regions on the face and have high stability and strong correlation with internal body temperatures [19,47,48]. However, there is no consensus about how canthi temperature should be read (e.g., how to identify location, size of region to use, number of pixels, averaging vs. maximum value, etc.). Among all the temperatures obtained from the inner canthi region, our initial study demonstrated that T C E m a x , the maximum temperature of the extended canthus region (see Figure 2), has the best correlation with the reference oral temperature T r e f and the highest sensitivity (Se) and specificity (Sp) values for fever screening [41]. Therefore, we chose T C E m a x for further study in this paper. Our previous work also demonstrated that the whole face maximum temperature ( T m a x ) is easy to localize/calculate and has comparable performance to T C E m a x , especially considering that for 59.5% of subjects, T m a x and T C E m a x have the same location. Please see reference [41] for the distribution of thermal maxima in full-face images. Since many NCITs measure temperature from the forehead-center location with a small sensor, T F C measured with an IRT was used as a surrogate for NCITs. Other NCITs use a sensor array to detect temperature in a larger forehead region; T F C m a x was used as a surrogate for such devices since a similar region is detected.

2.4. Clinical Data

Data from 1115 subjects were originally collected. Of these, 6 subjects had incomplete records. The data for 56 subjects were also removed because the difference between the two oral temperature readings was greater than 0.5 °C, or only one oral temperature reading was recorded. The large difference might come from an operation error (e.g., oral thermometer moved) or the subjects have recently smoked or ingested cold or hot food or drink [49]. Of the remaining subjects, we further excluded 33 subjects for IRT-1 and 43 subjects for IRT-2 whose images had degraded quality due to motion artifacts. Finally, we had data from 1020 subjects measured with IRT-1 and 1010 subjects measured with IRT-2.
The data for each IRT were separated into two groups—Group 1 with ambient temperature ranged from 20 to 24 °C and Group 2 from 24 to 29 °C (Table 2). The temperature ranges are different because the clinical study lasted a long time at two different locations (a small room and hallway), resulting in large ambient temperature variation. Group 1 data were first analyzed in our prior work [41], since ISO/TR 13154:2017 [31] recommends ambient temperature range of 20–24 °C. We analyzed Group 2 data with the same methodology as Group 1 data analysis in terms of the correlation coefficients and the area under the curve (AUC) values for different receiver operator characteristic (ROC, described further in Section 2.6.2) curves. The results show that both groups have similar performance in terms of correlation coefficients (Table 3) and AUC values (Table 4). In this study, we evaluate IRT clinical accuracy with more metrics than our previous analysis, which needs larger amount of data for calibration and testing. Therefore, both Group 1 and Group 2 data were used in the current paper.

2.5. Regression Methods for Imputing Oral Temperature

Many IRTs convert measured skin temperature ( T s k i n ) to an imputed corresponding temperature at a reference body site [34], often sublingual oral temperature ( T o r a l ), which is called cross-site measurement in this paper. In this study, we evaluated the clinical accuracy of two IRTs based on a cross-site measurement approach. Data acquired for each subject include thermal images, NCIT readings (analyzed in [45]) and reference sublingual temperature ( T r e f ). Thermal images were used to extract T s k i n at different regions of interest ( T F C , T F C m a x , T C E m a x and T m a x ). The conversion from T s k i n to T o r a l required the use of a calibration curve, so subjects for each IRT were randomly separated into training and testing sets. The training set (60% of the subjects, 612 and 606 for IRT-1 and IRT-2 respectively) was used to establish the relationship between different T s k i n and T r e f . The testing set (remaining 40% of subjects, 408 and 404 for IRT-1 and IRT-2 respectively) was converted to T o r a l values based on the calibration curve, then compared with T r e f to evaluate clinical accuracy.
The relationship between T s k i n and T r e f can be determined with different regression methods. In our previous study [41], we observed that T s k i n and T r e f appear to be related by a constant offset or a linear relation. Therefore, constant offset and ordinary linear regression methods are applied here. Quadratic or higher order polynomial regressions are also considered. Since T r e f values likely contain significant error, Deming regression may also be appropriate [50].
Since the distribution of T r e f values is not uniform across the temperature range (See the Kernel density curves in Section 3.1), with significantly less data at low and high temperatures, three regression approaches were considered. Weighted linear regression is a technique that adjusts the influence of individual data points based on a predefined criterion [50]. Common weighting methods are often based on variance or coefficient of variation (CV). For example, a constant CV least-squares regression gives each point a weight inversely proportional to the square of the values on the x-axis [50]. We implemented a weighted regression method with the weight being inversely related to the kernel density of the independent variable, i.e., greater weight was applied to a temperature range with fewer data points. A second approach implemented, called a binning method here, involved dividing the training data into small intervals (“bins”) and the data in each interval are averaged as one value for regression. A third approach used to mitigate the uneven data distribution was segmented linear regression, also known as piecewise regression. In this method, training data were separated into several segments and linear regression is applied to each. The equations for each segment were forced to agree at the edges to ensure continuity.

2.6. Clinical Accuracy Assessment

The clinical accuracy of IRTs can be evaluated in two ways. One way is to see whether IRTs can accurately measure body temperature in a specific temperature range, called temperature measurement accuracy in this paper. The other way is to see whether IRTs can screen out subjects with EBT from those without EBT, called diagnostic performance in this paper.

2.6.1. Metrics for Temperature Measurement Accuracy

We evaluated the temperature measurement accuracy of IRTs using several different approaches. Since there is no standard that covers clinical study data analysis for IRTs, standards for thermometers were used to inform our methodology. The standards ISO 86601-2-56:2017 [34] and ASTM E1965-98:2016 [33] implement three key metrics: clinical bias ( Δ c b ), standard deviation (SD) of Δ c b ( σ Δ c b ), and clinical repeatability ( σ r ). Δ c b is the mean difference between T o r a l and T r e f values for all subjects in the testing set. It shows systematic error of the devices under test. Measurement precision was evaluated using σ Δ c b , which is based on the SD of differences between T o r a l and T r e f . A value equal to 2 × σ Δ c b is often called the limit of agreement ( L A ), as it shows the magnitude of potential disagreement between outputs of two devices when used on the same human subject. Difference plots are used to illustrate Δ c b and σ Δ c b .
Root-mean-square (RMS) difference ( A r m s = 1 n i = 1 n ( T o r a l T r e f ) 2 , where n is the number of subjects) between T o r a l and T r e f , is another metric used to assess clinical measurement accuracy in medical devices [51]. While A r m s will not indicate the direction of error (e.g., overestimate or underestimate) and error distribution, it does quantify the cumulative magnitude of error. We implement it here to provide a single accuracy metric that combines the impact of bias and precision, as well as to ensure that positive and negative local bias values do not cancel out to give an erroneous impression of strong performance, as can occur with Δ c b .
Regression analysis [50] can also provide useful insight into the quality of temperature measurements. We generated scatter plots of T o r a l against T r e f and fit linear trendlines to the data; these curves were then compared with the ideal (i.e., T o r a l = T r e f ). Pearson correlation coefficients ( r values) were also obtained to quantify the degree of linear correlation between T o r a l and T r e f .

2.6.2. Metrics for Diagnostic Performance

In addition to methods focused on temperature measurement accuracy, we also implemented diagnostic performance assessment techniques to evaluate fever screening effectiveness for each IRT. These analyses involved calculation of sensitivity (true positive rate, Se = TP/P, where TP and P represent true positive and condition positive respectively) and specificity (true negative rate, Sp = TN/N, where TN and N represent true negative and condition negative respectively). The focus of this approach is to determine whether febrile subjects can be detected given specific reference temperature thresholds ( T t h r e s h ). The value for T t h r e s h was set to 37.5 °C to define P ( T r e f   > T t h r e s h ) and N ( T r e f   < T t h r e s h ) for fever screening [2,27]. We also defined a cutoff temperature ( T c u t ) to determine positive or negative results based on T o r a l . Based on the P, N, predicted P ( T o r a l   >   T c u t ) and predicted N ( T o r a l   <   T c u t ) for all subjects, TP ( T o r a l   >   T c u t and T r e f   > T t h r e s h ) and TN ( T o r a l   <   T c u t and T r e f   < T t h r e s h ) were obtained to calculate Se and Sp. At each T c u t , a pair of Se/Sp values were determined. An ROC curve for each facial temperature location was generated from 1000 T c u t values equally spaced between 30 °C and 40 °C. The area under the ROC curve (AUC), an effective and combined measure of Se and Sp, was calculated to provide an aggregate measure of performance, where a maximum AUC of 1 indicates perfect diagnostic performance in differentiating diseased with non-diseased subjects [52,53]. The value of ( 1 S e ) 2 + ( 1 S p ) 2 , notated as d S e S p , indicates the distance between the coordinate points of (1 − Sp, Se) and (0, 1), the perfect 1 − Sp and Se values [52]. The smaller the d S e S p value, the better the performance. The value of d S e S p at T c u t   = T t h r e s h   = 37.5 °C was used to evaluate the fever screening performance.

3. Results

3.1. Regression Methods for Calibration

As mentioned in Section 2.5, the training data (for 612 and 606 subjects with IRT-1 and IRT-2 respectively) were used to determine the relationship between different T s k i n ( T F C , T F C m a x , T C E m a x or T m a x ) and T r e f with different regression methods (constant offset, ordinary linear, quadratic, and Deming). We also implemented weighted linear, binning, and segmented linear regression methods due to the nonuniform distribution of temperatures. While the quadratic method usually showed nearly identical regression curves (Figure 3) with the segmented linear regression method, it led to nonmonotonic regression curves for some cases. Therefore, only the segmented linear regression method is discussed in this paper.
Figure 4 shows regression curves based on the training data. The segmented linear regression curve is omitted for simplification in this figure. We used different T s k i n as independent variables (x-axis) and T r e f as the dependent variable (y-axis) in all the regression methods. In Section 4.1, we will briefly discuss the methods of using T r e f as independent variable.
The results in Figure 4 indicate that lines for constant offset, ordinary linear, and Deming regression methods exhibit a common point of concurrency in each graph, near T r e f   ≈ 37 °C, T F C ≈ 34.5 °C, T F C m a x   ≈ 35 °C, T C E m a x   ≈ 35.5 °C, and T m a x   ≈ 35.7 °C for both IRT-1 and IRT-2. That these lines intersect near a single point is likely because the least squares approach minimizes the sum of squared residuals, which means each data point contributes equally to the sum. Therefore, a temperature interval with more data will have larger impact on the fitting equation. The location of each point of concurrency is related to the mean temperature offset between the reference value and facial measurements, which was discussed previously [41]. Figure 5 shows the kernel density curves of T r e f , T F C , T F C m a x , T C E m a x , and T m a x for IRT-1 and IRT-2. The curves for both IRTs are very similar, with the peak density for each site matching the corresponding points of concurrency. The Pearson correlation coefficients between T r e f and T F C / T F C m a x / T C E m a x / T m a x for IRT-1 are 0.53, 0.60, 0.79 and 0.82 respectively. These numbers for IRT-2 are 0.52, 0.57, 0.80, and 0.82.

3.2. Temperature Measurement Accuracy—Quantitative Analysis

The testing data (for 408 and 404 subjects with IRT-1 and IRT-2, respectively) were used to evaluate temperature measurement accuracy. The calibration curves based on different regression methods were applied to impute T o r a l from different T s k i n values ( T F C , T F C m a x , T C E m a x or T m a x ). By comparing final imputed T o r a l with T r e f , temperature measurement accuracy could be evaluated in different ways, as described in Section 2.6.
To calculate clinical bias ( Δ c b ), clinical bias SD ( σ Δ c b ), and root-mean-square difference ( A r m s ), we separated the testing data into three intervals based on T r e f : T r e f < 37 °C, 37 °C ≤ T r e f ≤ 38.5 °C, and T r e f > 38.5 °C. Since the diagnostic threshold ( T t h r e s h , the T r e f to define condition positive/negative) for fever screening is usually between 37.5 and 38 °C [41], the interval of 37.0–38.5 °C is particularly important. Results for Δ c b , σ Δ c b , and A r m s were calculated for the entire testing set and each of the three intervals. As described in our previous study (Figure 2 in [41]), we acquired thermal images of each subject in four rounds. During each round of imaging, each IRT acquired three consecutive frames (acquisition time ~0.1 s) that were averaged to reduce noise and form a single thermal image. All analysis in this article was based on the averaged thermal images from the first round of measurements, except for the clinical repeatability ( σ r ) analysis. To calculate σ r , the SD of three T o r a l temperatures based on the averaged thermal images from each of the first three rounds of measurements was calculated for each subject and then pooled based on the ISO 80601-2-56 standard [34].
Table 5 and Table 6 display key metrics ( Δ c b , σ Δ c b , A r m s , and σ r ) for T C E m a x - and T m a x -based T o r a l for IRT-1 and IRT-2 respectively. In these results, the minimum Δ c b , σ Δ c b and A r m s values for all subjects and subjects with T r e f < 37 °C generally come from the segmented linear regression method for both IRTs. The smallest Δ c b values over the range 37 °C ≤ T r e f ≤ 38.5 °C are between ±0.1 °C for both IRTs, coming from the constant offset, weighted linear, and binning methods. The related σ Δ c b and A r m s values over this range are less than 0.4 °C. The average σ r for both IRTs and all regression methods is 0.14 °C, with the minimum and maximum values of 0.07 °C and 0.23 °C. There is no one regression method that can achieve the best values for all the metrics and both IRTs. Later, we will demonstrate that temperature measurement accuracy over the range 37 °C ≤ T r e f ≤ 38.5 °C is more related to diagnostic performance.

3.3. Temperature Measurement Accuracy—Graphical Analysis

Results that characterize variations in IRT temperature measurement accuracy are displayed graphically to elucidate variations across the covered temperature range and the presence of exceptional values or outliers. Scatter and difference plots provide useful tools for these types of analyses.

3.3.1. Scatter Plots

A scatter plot provides a direct qualitative illustration of the clinical accuracy and the underlying variability of the relationship between T o r a l and T r e f . In the plots, we used T r e f as the x-axis and T o r a l imputed from different T s k i n values as the y-axis. Figure 6 shows example scatter plots of T o r a l imputed from T m a x based on the constant offset, weighted linear, binning, and segmented linear regression methods versus T r e f for IRT-1, since these methods show at least one of the best performance metrics in Table 5 and Table 6. Plots for T o r a l imputed from other T s k i n , based on other regression methods, and for IRT-2 are not presented here due to space limitations.
Results in Figure 6 indicate that the segmented method produced the best fit (largest R2 value), whereas the binning method produced the trend line that was closest to the ideal T o r a l = T r e f line. Given the highly non-uniform distribution of data, small differences in the slopes of the trend lines do not reflect overall accuracy differences. Two vertical lines at T r e f = 37 °C and 38.5 °C separate the data into three temperature intervals for comparison with Table 5. Data above the ideal trend line cause a positive Δ c b and vice versa. A wide data distribution in the vertical direction correlated with a large σ Δ c b . For example, the points in Figure 6c are the most dispersed in the vertical direction although the trend line is close to the ideal line, and the points in Figure 6d are the least dispersed. This indicates that σ Δ c b for the binning method is the largest and σ Δ c b for the segmented linear method is the smallest among the four regression methods, as have been shown in Table 5. Therefore, the trend line slope and intercept, the data point variability, and the coefficient of determination should be considered all together when reading a scatter plot. A direct qualitative view of the clinical accuracy through a scatter plot should be supported by quantitative values of other metrics, such as Δ c b , σ Δ c b , A r m s , σ r , and Se/Sp/ d S e S p .

3.3.2. Difference Plots

A difference plot directly shows the distribution of all the data that are used to calculate Δ c b and σ Δ c b . It can also be used to identify proportional bias. The vertical axis of the plot is the difference between T o r a l and T r e f . The horizontal axis is the average of T o r a l and T r e f . About 95% of the difference values will fall in the range of Δ c b   ± 2 σ Δ c b if the values are normally distributed [34]. The difference plots for T o r a l calculated from T m a x based on the constant offset, weighted linear, binning, and segmented linear regression methods for IRT-1 are displayed in Figure 7 as examples. The first impression from Figure 7 is that some plots have an apparent trend (proportional bias), which is also seen in the corresponding scatter plots in Section 3.3.1 and Appendix A. For example, T o r a l and T r e f show strong correlation in Figure 6d, yet more T o r a l values tend to be higher than T r e f at lower temperatures and lower than T r e f at higher temperatures. A corresponding trend of proportional bias is seen in Figure 7d. On the other hand, a slight trend might still exist even if two sets of data have a high degree of agreement [54]. For the T m a x -based T o r a l , the segmented linear regression method provides the smallest Δ c b and σ Δ c b that agrees with Table 5.

3.4. Diagnostic Performance

Variations in the ability of IRT systems to detect febrile subjects were analyzed using the Se/Sp approach based on clinically relevant thresholds. The ROC curves based on T o r a l imputed from each T s k i n under different regression methods were generated (not shown in this paper to reduce space), from which the Se/Sp values for T c u t = T t h r e s h = 37.5 °C were derived and the d S e S p values were calculated. Table 7 shows the Se/Sp and d S e S p values for T C E m a x - and T m a x -based T o r a l with different regression methods. Compared with Table 5 and Table 6, we can see a strong relationship between Δ c b / σ Δ c b / A r m s values in the range of 37 °C ≤ T r e f ≤ 38.5 °C and Se/Sp—the minimum values of Δ c b / σ Δ c b / A r m s are correlated to the minimum values of d S e S p (i.e., the largest Se/Sp combination). The smallest Δ c b / σ Δ c b / A r m s values over the range 37 °C ≤ T r e f ≤ 38.5 °C (Table 5 and Table 6), as well as optimum Se/Sp combinations for T o r a l (Table 7) come from the constant offset, weighted linear, and binning methods. On the other hand, the temperature measurement metrics over the full temperature range are not related to the d S e S p values. Therefore, if an IRT is designed for fever screening, the clinical accuracy in the range of 37–38.5 °C (oral cavity as the reference site) is more important than in other ranges. An IRT with the smallest Δ c b / σ Δ c b / A r m s values within the whole temperature range does not necessarily mean it has the best Se/Sp for fever screening. For example, the Se/Sp values based on the segmented regression method are the worst for T C E m a x - and T m a x -based T o r a l due to the large Δ c b values in the range of 37.0 °C ≤ T r e f ≤ 38.5 °C, although the values of Δ c b , σ Δ c b and A r m s based on this method across the full temperature range are the best.
To further analyze this issue, we defined the optimal cutoff temperature ( T o p . c u t ) as the T c u t that minimizes d S e S p (lengths of green line segments in Figure 8) [52], as obtained from the ROC curve. We also define predicted optimal cutoff temperature ( T p . o p . c u t ) as the T c u t imputed based on T t h r e s h and Δ c b in the temperature range of 37.0–38.5 °C, T p . o p . c u t   = T t h r e s h   + Δ c b . For brevity, we only show the ROC curves based on T o r a l imputed from T m a x and regression methods of constant offset, weighted linear, and segmented linear for IRT-1 in Figure 8. The Se/Sp values for T c u t equals T o p . c u t , T p . o p . c u t , and T t h r e s h are labeled together in each graph. From Figure 8, the T o p . c u t and T p . o p . c u t values are rather close with a difference of less than 0.1 °C, except for the segmented linear graph with a difference of 0.16 °C. The average difference between T o p . c u t and T p . o p . c u t is as small as 0.08 °C. The results indicate that the fever screening performance of an IRT can be optimized by adjusting the T c u t value based on Δ c b in the range of 37 °C ≤ T r e f ≤ 38.5 °C. Figure 8c also illustrates the poor Se values based on the segmented linear regression method in Table 5 because of large Δ c b in the range of 37 °C ≤ T r e f ≤ 38.5 °C.

3.5. Clinical Accuracy—IRTs Versus NCITs

There have been inconsistent conclusions regarding the clinical accuracy of IRTs versus NCITs. A document from the Centers for Disease Control and Prevention indicates that IRTs are not as accurate as NCITs and may be more difficult to use effectively [55]. However, several scientific studies have shown different opinions [21,22]. Further discussion of this topic is needed. As described in our previous article [41], the temperature of each subject was measured with two IRTs and six NCITs. A full analysis of the NCIT data is presented elsewhere [45]. Therefore, it is potentially useful to directly compare the clinical data collected by these two different IRTs and six models of NCITs. On the other hand, IRTs can measure temperature from different facial locations. The measurements from the forehead can be a surrogate for NCIT measurements and thus be used to indirectly compare NCIT and IRT performance.

3.5.1. Direct Performance Comparison

During our clinical study, two different IRTs and six models of NCITs were used to collect temperature data from each subject. The laboratory and clinical accuracy of these six models of NCITs has been analyzed in references [56] and [45] respectively. Laboratory results indicate that five of the six NCIT models did not meet the laboratory acceptance criterion of ±0.3 °C recommended by the ASTM E1965-98:2016 standard [33]. The algorithms used by these NCITs to convert temperature from the measurement site to the reference site (i.e., regression methods for imputing T o r a l from T s k i n ) are unknown.
Clinical NCIT results (Table 2 in [45]) show that mean Δ c b   ± σ Δ c b values for the six models (A, B, C, D, E, F) over the full temperature range were −0.26 ± 0.46 °C, −0.23 ± 0.42 °C, 0.15 ± 0.41 °C, −0.32 ± 0.58 °C, −0.88 ± 0.54 °C, and 0.22 ± 0.46 °C. Depending upon the NCIT model, 48–88% of the temperature measurements were beyond the labeled accuracy, which aligns well with the results from another study [57]. On the other hand, the worst/best Δ c b   ± σ Δ c b values for T m a x -based T o r a l across the full temperature range were −0.09 ± 0.41 °C/−0.03 ± 0.29 °C for IRT-1 and 0.19 ± 0.32 °C/0.01 ± 0.27 °C for IRT-2 (Table 5 and Table 6). These results indicate that the two IRTs have similar accuracy, and both have better bias and precision than the six models of NCITs, even with the worst regression method.
NCIT results (Figure 4 in [45]) also showed that for a T t h r e s h of 37.5 °C, the Se/Sp values for the six models were 0.11/1.00, 0.35/0.99, 0.58/0.97, 0.40/0.98, 0.03/1.00, and 0.70/0.85 respectively, with the d S e S p values being 0.89, 0.65, 0.42, 0.60, 0.97, and 0.34, respectively. On the other hand, the Se/Sp values were 0.89/0.87 and 0.88/0.88 for T C E m a x - and T m a x -based T o r a l measurements by IRT-1 calibrated with the weighted linear regression method, with the related d S e S p values being 0.18 and 0.17, respectively (Table 5 and Table 6). A comparison of these data indicates that IRTs can be more effective to screen subjects with EBT than NCITs.

3.5.2. Indirect Comparison Based on Imaging Results

Given the similarities in physical working mechanism and facial location, IRT data for T o r a l calculated from T F C and T F C m a x (Table A1 and Table A2 for IRT-1 and IRT-2, provided in Appendix A for brevity) may provide a useful surrogate for NCIT measurements. These results were compared with IRT data for T o r a l calculated from T C E m a x and T m a x   (Table 5 and Table 6 for IRT-1 and IRT-2). From Table A1 and Table 5, the optimal Δ c b and σ Δ c b values across the full T r e f range for T C E m a x - and T m a x -based T o r a l have minimal differences from the values for T F C - and T F C m a x -based T o r a l . However, these values in the T r e f range of 37–38.5 °C are 0.22 ± 0.35 °C and 0.18 ± 0.34 °C for T F C - and T F C m a x -based T o r a l versus 0.05 ± 0.30 °C and 0.08 ± 0.29 °C for T C E m a x - and T m a x -based T o r a l respectively. Multiple comparisons were performed between the four sets of Δ c b values (noted as A, B, C and D) for T F C -, T F C m a x -, T C E m a x - and T m a x -based T o r a l data using the Tukey Honest Significant Difference method. The results indicate that the forehead measurement site typically used by NCITs tends to provide poorer accuracy than a full-face approach or one that targets the inner canthus (p-values < 0.05 between A/B and C/D). On the other hand, there is no significant difference between A and B or C and D (p-values > 0.05), indicating the full-face and inner cantus approaches have similar optimal Δ c b and σ Δ c b values.
Comparisons of diagnostic performance for EBT detection between these measurement approaches can also be made from data in Table 7, Table A1 and Table A2. The optimal Se/Sp values identified for IRT-1 are 0.67/0.82 or 0.74/0.72 for T F C -based T o r a l , 0.67/0.87 or 0.72/0.78 for T F C m a x -based T o r a l (Table A1), versus 0.89/0.87 for T C E m a x -based T o r a l , and 0.88/0.89 for T m a x -based T o r a l (Table 7). The results for IRT-2 in Table 7 and Table A2 are similar. The optimal d S e S p values identified for both IRTs are between 0.31 and 0.38 for T F C - and T F C m a x -based T o r a l , which are close to the best d S e S p value for the six models of NCITs.
Corresponding scatter plots, difference plots, and ROC curves based on T o r a l calculated from T F C are provided (Figure A1, Figure A2 and Figure A3 in Appendix A) for IRT-1 to mirror the results in Figure 6, Figure 7 and Figure 8, for T o r a l calculated from T m a x . The ROC curves for T F C are significantly lower than the curves for T m a x , which agree with the Se/Sp values in Table 7 and Table A1 and indicate the potential low Se/Sp values of NCITs. The scatter plots of T F C -based T o r a l versus T r e f (Figure A1) are more dispersed and their trend lines are further from the ideal line than the graphs for T m a x , indicating larger Δ c b and σ Δ c b for T F C -based T o r a l . Comparisons of difference plots for T F C - and T m a x -based T o r a l show the same conclusion.

4. Discussion

Through an extensive clinical study of over 1000 subjects, we have evaluated the clinical accuracy of two IRTs under controlled conditions for temperature measurement. The clinical accuracy of the IRTs has been quantitatively evaluated with different metrics including Δ c b , σ Δ c b , A r m s , σ r , and Se/Sp/ d S e S p . Dividing the data into training and testing sets, we have studied the impact of calibration approaches and methods for establishing diagnostic cutoff temperatures, and elucidated differences in performance between IRTs and NCITs. The results are displayed with scatter plots, difference plots and ROC curves. Overall, these findings provide unique and valuable insights into both the optimization and assessment of IRT-based devices for temperature estimation and fever detection.

4.1. Effects of Regression Methods on the Clinical Accuracy

Our analysis of regression approaches indicated no clear optimal method that can improve all clinical accuracy metrics. A specific regression method tended to provide the best clinical accuracy in terms of a specific metric. When the full range of temperatures were considered in our data, the segmented linear regression provided the smallest A r m s values, the least scatter (and the highest R2 value) in Figure 6, and the narrowest difference distribution range in Figure 7. However, when we restricted the temperature range to the diagnostic zone (37 °C ≤ T r e f ≤ 38.5 °C), the constant offset, weighted linear, and binning methods provided the highest Se/Sp and the smallest bias.
To apply different regression methods to find the relation between T s k i n and T r e f , we used T s k i n and T r e f as independent and dependent variables, respectively. In theory, the independent variable should be the one that is more accurate, in our case, T r e f . If we used T r e f and T s k i n as independent and dependent variables respectively, the function we obtained will be T s k i n = f( T r e f ). During the evaluation, this function should be used inversely ( T o r a l = f   1 ( T s k i n ) ) to convert T s k i n to T o r a l . The inverse operation might cause extra errors. We applied the inverse equations of these regression equations to the testing data and calculated the same clinical accuracy metrics (For brevity, not included in this paper) as shown in Table 5, Table 6 and Table 7. We did not find clinical accuracy improvement in terms of these metrics.

4.2. Metrics and Requirements for Evaluating Clinical Accuracy

Table 5, Table 6 and Table 7 show different clinical accuracy metrics for IRT-1 and IRT-2 respectively, including Δ c b , σ Δ c b , A r m s , σ r , and Se/Sp/ d S e S p . While Δ c b and σ Δ c b are recommended in international thermometer standards, they do not necessary represent the optimal metrics for all applications. One limitation of Δ c b as a performance metric is that it is mean value only reflecting the systematic bias and that large positive and negative local biases may cancel out, thus producing a small Δ c b value, as if the local biases were small. Therefore, Δ c b and σ Δ c b should always be evaluated together. The metric A r m s is the root-mean-square difference between measured values ( T o r a l ) and reference values ( T r e f ) [51]. Being a single accuracy metric that combines the impact of Δ c b and σ Δ c b , it helps ensure that positive and negative local bias values do not cancel out to give an erroneous impression of strong performance, as can occur with Δ c b . However, A r m s does not indicate whether errors are mainly positive or negative and does not distinguish systematic and random errors. Another metric that was not discussed in this article, mean absolute error (MAE), is similar to A r m s and might also be considered.
The values of Δ c b , σ Δ c b and A r m s for different temperature ranges might have different significance. If an IRT is designed for fever screening, then values of these metrics within the reference temperature range of 37–38.5 °C are more important than those based on the full temperature range, since they most directly impact diagnostic ability. For such a device, Se/Sp values for common T t h r e s h values (e.g., 37.5 °C or 38 °C) might be stronger performance metrics than Δ c b and σ Δ c b . The AUC value is commonly quoted for ROC curves [41], which may be a better metric for overall performance since it is an aggregate measure of diagnostic capability. The higher the AUC, the greater the potential of an IRT to distinguish subjects with and without EBT. To achieve the full potential of the IRT, the optimal cutoff temperature to obtain the least d S e S p can be predicted based on T t h r e s h and Δ c b in the temperature range of 37.0–38.5 °C, T p . o p . c u t   = T t h r e s h + Δ c b . In reality, users can also increase or decrease T c u t to increase Sp or Se at the cost of decreasing Se or Sp at the same time.
Relatively little consensus has been achieved in the establishment of minimum performance requirements for IRTs. Currently, we are only aware of one consensus requirement for IRT laboratory accuracy. The IEC 80601-2-59: 2017 standard [30] requires that laboratory error of IRTs be below 0.5 °C in the T s k i n range of 34–39 °C [32]. Performance requirements in thermometer standards may also be adapted for use with IRTs: ISO 80601-2-56:2017 for clinical thermometers [34], ASTM E1112-00:2011 for electronic thermometers [58], and ASTM E1965-98:2016 for infrared thermometers [33]. The maximum permissible errors defined in these standards are listed in Table 8.
None of the aforementioned standards includes clinical accuracy requirements for IRTs or thermometers. The ISO 80601-2-56:2017 standard provides a clinical example where Δ c b   ± σ Δ c b is 0.07 ± 0.22 °C. The text indicates that the Δ c b value is acceptable and the σ Δ c b value could be considered by some to be clinically acceptable, although it is relatively high. The ASTM E1965-98:2016 standard also provides an example of clinical accuracy evaluation results for an infrared thermometer, with Δ c b   ± σ Δ c b values of −0.25 ± 0.35 °C, −0.16 ± 0.18 °C, and 0.11 ± 0.21 °C for age groups of infants, children, and adults, respectively. The standard indicates that the thermometer under test may not be sufficiently accurate for use on infants since errors in temperature measurements may be clinically significant. Nevertheless, these examples do not define clinical accuracy requirements. Based on our study, an IRT can provide a good fever screening performance ( d S e S p   ≤ 0.2) if σ r   ≤ 0.2 °C and its temperature measurement accuracy satisfies these requirements within the temperature range of 37.0–38.5 °C with oral cavity as the reference body site: −0.1 °C ≤ Δ c b ≤ 0.1 °C, σ Δ c b ≤ 0.4 °C, A r m s ≤ 0.4 °C. For our IRTs, these requirements are met for the T C E m a x - and T m a x -based T o r a l data imputed with the weighted linear (for IRT-1 and IRT-2) and constant offset (for IRT-2 only) methods.

4.3. Difference Plot Methods

In Section 3.3.2, we used the mean of T o r a l and T r e f as the horizontal axis of the difference plots, based on the Bland–Altman approach. In theory, the horizontal axis of the plot is determined based on the best estimate of the true values [50]. While we believe T r e f is more accurate than T o r a l , T r e f also presents error with the SD of two measurements being ~0.1 °C. Moreover, there is no consensus in the literature as to the optimal approach for thermographic data analysis. Bland and Altman argued that the difference against the reference measurements will show a relationship between them when none exists [54]. Therefore, they recommended that the mean value be used on the horizontal axis. However, researchers still often use reference values alone as the horizontal axis [50,59,60], believing reference values are the best estimate of the true values. We redrew the difference plots of Figure 7 with T r e f as the horizontal axis, as shown in Figure 9. From the figure, we can see that the trends in Figure 9 are different from the trends in Figure 7. Negative correlation can be seen in Figure 9 as Bland and Altman predicted [54]. However, a significant advantage of one approach over the other is not clearly apparent.

4.4. Performance Comparison of IRTs and NCITs

IRTs and NCITs represent the primary device types currently used in practice for real-time measurement of EBT during epidemics [17,18,19,29]. They both use passive remote sensing technologies that detect mid- and/or long-wave IR radiation and convert measurements to temperature based on the Stefan–Boltzmann law [61]. NCITs estimate temperature at a reference body site (usually oral) based on radiation from a small region of skin (e.g., forehead) [33], whereas IRTs provide a 2D temperature distribution of the face and may target a specific region (e.g., inner canthi) [30]. FDA has cleared NCITs to independently measure human body temperature, yet no IRT has been cleared for a similar purpose. Current IRTs on US market are only authorized for emergency use [62]. In several scientific studies, the accuracy of NCITs has been called into question, particularly relative to IRTs [21,22]. Our study provides another angle to compare IRTs with NCITs.
Both indirect and direct comparisons of IRTs with NCITs indicate that when designed for optimal performance, the clinical accuracy of IRTs will likely be greater than that of NCITs. The two IRTs have similar accuracy, and both have better bias and precision than the six models of NCITs, even with the worst regression method. One reason for this may be the use of the forehead as the NCIT measurement location. The skin temperature at this location tends to be sensitive to environmental factors such as ambient temperature and airflow, which may degrade correlation with core/oral temperature [23]. The IRTs implemented in the current study also use higher performance electronic components than the typical portable NCIT, and thus are much more expensive. Of course, in order for an IRT to achieve a high degree of clinical accuracy it will need to meet laboratory accuracy requirements [32], have an effective algorithm to convert the measured skin temperature to the temperature at a reference body site (e.g., oral cavity), and be deployed and operated according to established best practices.
In summary, from both temperature measurement accuracy and diagnostic performance standpoints, approaches based on forehead measurements, as with most NCITs, are likely to be inferior to those involving the full face or inner canthus measurements recommended for IRTs.

4.5. Study Challenges and Limitations

While our clinical study provided important insights, it is worth noting some of the key challenges we faced and the limitations to our findings. For example, the distribution of reference temperatures acquired is clearly uneven. Most subjects had oral temperatures of 37.0 ± 0.5 °C and the number of subjects with an EBT was limited. While the temperature distribution across a typical population would likely be somewhat Gaussian, an optimal data set would provide a more uniform distribution of temperatures across the normal through febrile range. However, it was difficult to recruit febrile subjects, which is a common problem for clinical fever screening studies [25]. Our study was initially designed to have a large population (~1000 subjects) in order to accrue a statistically significant sample of febrile subjects, despite a relatively low prevalence. As a result, we were able to obtain a greater number of data sets from febrile subjects than most clinical studies.
Perhaps the most significant caveat to our results is the limited age range of the study population. Overall, 95% of subjects were under 30 years of age. Research on the effect of age on IRT accuracy is limited, yet one paper has shown that the best correlation of IRT temperatures with core temperature is seen in children (aged 3–18 years) [63]. While our study did not include subjects below 18 years old, about half were in the 18–21 range. Therefore, the results in this paper might not represent the accuracy for all age groups. A clinical study for system validation should cover all age groups, dependent on the device application. Since the two sets of data for training and testing were based on the same pool of data and random selection was used to determine the two sets, the performance estimates may be biased (upwards) and not generalizable in the target population [64]. As such, it is likely that our study may represent a best-case scenario.
The subject circadian rhythm might also affect fever screening performance. For example, different studies have shown that core body temperature in the morning maybe 0.3–0.9 °C lower than in the afternoon [13,14,65]. We did not consider circadian rhythm in our analysis, yet additional study of this variable and the need for methods to mitigate its impact in infectious disease screening is warranted [66]. In the future, we intend to provide additional retrospective analysis of our data to assess this potential confounding factor.
To minimize the influence of outside temperature, a 15-min acclimation period was implemented prior to the start of measurements. However, oral temperature might still be affected by smoking or ingestion of cold or hot food or beverage during this time [67]. To mitigate this potential confounder, we extracted data sets for which the difference between the two oral temperature readings was greater than 0.5 °C as well as those where only one oral temperature reading was recorded. These exclusions amounted to 56 subjects. Such checks on data quality are useful for ensuring the validity of clinical IRT data [49].

5. Conclusions

Overall, our large-scale clinical study has generated unique and highly valuable quantitative information on fever-screening IRT performance and helped to identify potential best practices for the calibration and evaluation of IRT clinical accuracy. Current findings on IRT diagnostic performance were generally consistent with our prior analysis of results from 500 subjects, indicating IRTs have a strong potential for achieving high sensitivity and specificity in the detection of EBT. Algorithms used to impute oral cavity temperature based on skin temperature are critical for accurate clinical measurement. A simple offset approach may be effective in many situations, but when calibration data sets involve a high proportion of normal-range temperatures, then methods that account for this uneven distribution have key advantages. While metrics recommended in standards provide useful insights into IRT performance, implementing additional approaches like Arms to assess temperature measurement accuracy and Se/Sp for clinical diagnostic accuracy may be beneficial. Moreover, temperature measurement accuracy within a temperature window near the diagnostic threshold for fever may be more important for evaluating fever screening IRTs than accuracy within a full temperature range.
Direct and indirect comparisons of our custom IRT systems with commercial NCITs showed that the former (i.e., IRT systems) were more accurate and provide greater diagnostic efficacy. Our results indicate that this is due at least partly to the fact that IRTs measure temperature from a more thermally stable facial location provided by a large number of pixels (e.g., 320 × 240 pixels). The superior capability of IRTs may enable the detection of lower grade and/or earlier stage fevers. Compared with NCITs, IRTs might be a better choice for fever screening in high-traffic areas or higher-risk locations where the higher cost could be justified by greater effectiveness. Furthermore, an IRT operator is not required to be in physical proximity to the subject (e.g., the distance between subject and IRTs was 0.6–0.8 m in this study). Indeed, they could even be in a different area or room, or a completely automated approach could be implemented, thus reducing the risk of infection. Another advantage of IRTs is their ability to provide temperature data from a range of facial locations, such as the inner canthi for fever detection [41]. Spatial variations in facial temperature can also be related to certain diseases (e.g., skin inflammatory conditions, breast cancer, systemic inflammatory diseases, septic shock, and the healing potential of wounds) [68]. Finally, it should be noted that additional study of our clinical results will be needed to elucidate additional confounding factors.

Author Contributions

Conceptualization and funding acquisition, Q.W., T.J.P. and J.P.C.; methodology, Q.W., T.J.P., J.P.C., D.M., P.G. and Y.Z.; software, Q.W., P.G. and Y.Z.; investigation, Q.W., D.M. and P.G.; data curation, Q.W. and P.G.; formal analysis, Q.W. and Y.Z.; resources, supervision, and project administration, Q.W. and D.M.; writing—original draft preparation, Q.W.; writing—review and editing, Q.W., T.J.P., Y.Z., P.G., J.P.C. and D.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the U.S. Food and Drug Administration’s Medical Countermeasures Initiative (MCMi) Regulatory Science Program (Fund# 16ECDRH407).

Institutional Review Board Statement

The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by both FDA and UMD Institutional Review Boards under FDA IRB study #16-011R.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Not applicable.

Acknowledgments

This project was supported in part by an appointment to the Research Participation Program at the Center for Devices and Radiological Health, U.S. Food and Drug Administration, administered by the Oak Ridge Institute for Science and Education through an interagency agreement between the U.S. Department of Energy and FDA. The authors gratefully acknowledge the University Health Center of the University of Maryland at College Park for their outstanding collaboration with the research team during the clinical study; Feiming Chen for his valuable advice on statistical analysis; Stacey Sullivan, Jean Rinaldi, Prasanna Hariharan, and Oleg Vesnovsky for helpful discussions regarding the comparison between IRT and NCIT devices.

Conflicts of Interest

The authors declare no conflict of interest.

Disclaimer

The mention of commercial products, their sources, or their use in connection with material reported herein is not to be construed as either an actual or implied endorsement of such products by the Department of Health and Human Services. This article reflects the views of the authors and should not be construed to represent FDA’s views or policies. The authors declare that they have no competing interests.

Appendix A. Additional Data for T o r a l Based on Forehead Temperatures

Table A1. Clinical accuracy of IRT-1 for T o r a l based on T F C and T F C m a x : Δ c b , σ Δ c b , A r m s , σ r , Se/Sp, and d S e S p .
Table A1. Clinical accuracy of IRT-1 for T o r a l based on T F C and T F C m a x : Δ c b , σ Δ c b , A r m s , σ r , Se/Sp, and d S e S p .
T o r a l   Based   on   T F C T o r a l   Based   on   T F C m a x
OffsetOrdinaryWeightedSegmentedDemingBinningOffsetOrdinaryWeightedSegmentedDemingBinning
All Δ c b −0.02−0.020.47−0.04−0.020.10−0.03−0.020.41−0.04−0.030.07
Tref σ Δ c b 0.670.450.450.390.510.750.550.430.440.370.480.66
A r m s 0.660.450.650.390.510.750.550.430.600.370.480.66
Tref< Δ c b 0.120.210.700.180.170.210.100.190.610.160.140.17
37 °C σ Δ c b 0.630.260.280.220.420.720.520.270.310.210.410.64
A r m s 0.640.330.750.280.450.750.530.330.680.270.430.66
37 °C≤ Δ c b −0.18−0.280.22−0.31−0.24−0.04−0.19−0.270.18−0.29−0.22−0.05
Tref σ Δ c b 0.660.340.350.300.460.750.530.320.340.310.420.65
≤38.5 °C A r m s 0.680.440.410.430.510.750.560.420.390.420.480.65
Tref> Δ c b −0.87−1.71−1.15−1.10−1.32−0.56−0.96−1.61−1.06−0.93−1.23−0.57
38.5 °C σ Δ c b 0.460.380.360.740.320.560.430.340.320.640.330.58
A r m s 0.971.751.201.301.350.771.041.641.101.111.270.79
σ r 0.200.070.080.080.130.230.180.080.100.080.140.22
Se0.670.140.880.350.580.740.670.330.860.420.580.72
Sp0.821.000.480.990.920.720.871.000.620.990.940.78
d S e S p 0.370.860.530.650.430.380.350.670.410.580.420.36
Note: The bold font shows the best results (i.e., minimum values of Δ c b , σ Δ c b , A r m s , σ r , and d S e S p ). The green font indicates correlation between Δ c b in temperature range of 37.0–38.5 °C and d S e S p .
Table A2. Clinical accuracy of IRT-2 for T o r a l based on T F C and T F C m a x : Δ c b , σ Δ c b , A r m s , σ r , Se/Sp, and d S e S p .
Table A2. Clinical accuracy of IRT-2 for T o r a l based on T F C and T F C m a x : Δ c b , σ Δ c b , A r m s , σ r , Se/Sp, and d S e S p .
T o r a l   Based   on   T F C T o r a l   Based   on   T F C m a x
OffsetOrdinaryWeightedSegmentedDemingBinningOffsetOrdinaryWeightedSegmentedDemingBinning
All Δ c b 0.140.070.580.040.100.190.060.050.560.020.050.07
Tref σ Δ c b 0.700.430.430.400.480.730.570.400.400.360.440.63
A r m s 0.710.430.720.400.490.760.580.400.690.360.440.63
Tref< Δ c b 0.230.270.770.220.250.260.140.240.750.180.190.13
37 °C σ Δ c b 0.650.250.270.230.390.680.530.240.240.220.360.59
A r m s 0.690.370.820.310.460.730.550.340.790.280.400.60
37 °C≤ Δ c b 0.01−0.210.31−0.24−0.120.07−0.06−0.210.31−0.24−0.14−0.03
Tref σ Δ c b 0.760.370.380.370.490.800.630.350.350.350.450.69
≤38.5 °C A r m s 0.760.420.490.440.500.800.630.410.470.430.470.69
Tref> Δ c b 0.06−1.33−0.75−0.23−0.790.21−0.04−1.22−0.71−0.34−0.690.17
38.5 °C σ Δ c b 0.710.190.201.240.340.760.510.150.150.820.260.59
A r m s 0.671.350.771.180.850.740.481.230.720.840.730.58
σ r 0.220.060.070.080.130.230.200.070.070.080.130.22
Se0.700.260.860.330.600.740.740.350.880.370.600.74
Sp0.751.000.350.990.900.730.840.990.400.990.930.82
d S e S p 0.390.740.670.670.410.370.300.650.610.630.400.31
Note: The bold font shows the best results (i.e., minimum values of Δ c b , σ Δ c b , A r m s , σ r , and d S e S p ). The green font indicates correlation between Δ c b in temperature range of 37.0–38.5 °C and d S e S p .
Figure A1. Scatter plots of T o r a l imputed from T F C based on different regression methods versus T r e f for IRT-1. (Dashed lines: trend lines of T o r a l against T r e f ; Solid lines: ideal trend lines of T o r a l = T r e f ).
Figure A1. Scatter plots of T o r a l imputed from T F C based on different regression methods versus T r e f for IRT-1. (Dashed lines: trend lines of T o r a l against T r e f ; Solid lines: ideal trend lines of T o r a l = T r e f ).
Sensors 22 00215 g0a1
Figure A2. The temperature difference between T F C -based T o r a l and T r e f versus their average for IRT-1 in the entire temperature range (Solid lines: lines of zero difference. Dashed lines: lines of difference being ∆cb + 2σ∆cb, ∆cb, and ∆cb − 2σ∆cb respectively).
Figure A2. The temperature difference between T F C -based T o r a l and T r e f versus their average for IRT-1 in the entire temperature range (Solid lines: lines of zero difference. Dashed lines: lines of difference being ∆cb + 2σ∆cb, ∆cb, and ∆cb − 2σ∆cb respectively).
Sensors 22 00215 g0a2
Figure A3. The ROC curves based on T o r a l imputed from T F C and regression methods of constant offset, weighted linear and segmented linear for IRT-1. The triangle, circle and squre markers on curves show the Se/Sp values when T c u t equals T o p . c u t , T p . o p . c u t , and T t h r e s h respectively.
Figure A3. The ROC curves based on T o r a l imputed from T F C and regression methods of constant offset, weighted linear and segmented linear for IRT-1. The triangle, circle and squre markers on curves show the Se/Sp values when T c u t equals T o p . c u t , T p . o p . c u t , and T t h r e s h respectively.
Sensors 22 00215 g0a3

References

  1. Chiu, W.; Lin, P.; Chiou, H.; Lee, W.; Lee, C.; Yang, Y.; Lee, H.; Hsieh, M.; Hu, C.; Ho, Y. Infrared thermography to mass-screen suspected SARS patients with fever. Asia-Pac. J. Public Health 2005, 17, 26–28. [Google Scholar] [CrossRef]
  2. Nishiura, H.; Kamiya, K. Fever screening during the influenza (H1N1-2009) pandemic at Narita International Airport, Japan. BMC Infect. Dis. 2011, 11, 111. [Google Scholar] [CrossRef] [Green Version]
  3. Shi, H.; Han, X.; Jiang, N.; Cao, Y.; Alwalid, O.; Gu, J.; Fan, Y.; Zheng, C. Radiological findings from 81 patients with COVID-19 pneumonia in Wuhan, China: A descriptive study. Lancet Infect. Dis. 2020, 20, 425–434. [Google Scholar] [CrossRef]
  4. Yang, X.; Yu, Y.; Xu, J.; Shu, H.; Liu, H.; Wu, Y.; Zhang, L.; Yu, Z.; Fang, M.; Yu, T. Clinical course and outcomes of critically ill patients with SARS-CoV-2 pneumonia in Wuhan, China: A single-centered, retrospective, observational study. Lancet Respir. Med. 2020, 8, 475–481. [Google Scholar] [CrossRef] [Green Version]
  5. Goeijenbier, M.; Van Kampen, J.; Reusken, C.; Koopmans, M.; Van Gorp, E. Ebola virus disease: A review on epidemiology, symptoms, treatment and pathogenesis. Neth. J. Med. 2014, 72, 442–448. [Google Scholar]
  6. Huang, C.; Wang, Y.; Li, X.; Ren, L.; Zhao, J.; Hu, Y.; Zhang, L.; Fan, G.; Xu, J.; Gu, X. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet 2020, 395, 497–506. [Google Scholar] [CrossRef] [Green Version]
  7. Schuchat, A.; Covid, C.; Team, R. Public health response to the initiation and spread of pandemic COVID-19 in the United States, February 24–April 21, 2020. Morb. Mortal. Weekly Rep. 2020, 69, 551. [Google Scholar] [CrossRef] [PubMed]
  8. Widmaier, E.P.; Raff, H.; Strang, K.T. Regulation of Organic Metabolism and Energy Balance-Section B: Regulation of Total-Body Energy Balance and Temperature. In Vander’s Human Physiology; McGraw-Hill: New York, NY, USA, 2008; pp. 583–596. [Google Scholar]
  9. Lu, S.-H.; Dai, Y.-T. Normal body temperature and the effects of age, sex, ambient temperature and body mass index on normal oral temperature: A prospective, comparative study. Int. J. Nurs. Stud. 2009, 46, 661–668. [Google Scholar] [CrossRef]
  10. Kessel, L.; Johnson, L.; Arvidsson, H.; Larsen, M. The relationship between body and ambient temperature and corneal temperature. Investig. Ophthalmol. Vis. Sci. 2010, 51, 6593–6597. [Google Scholar] [CrossRef]
  11. Reilly, T.; Brooks, G. Exercise and the circadian variation in body temperature measures. Int. J. Sports Med. 1986, 7, 358–362. [Google Scholar] [CrossRef]
  12. Landsberg, L.; Young, J.B.; Leonard, W.R.; Linsenmeier, R.A.; Turek, F.W. Is obesity associated with lower body temperatures? Core temperature: A forgotten variable in energy balance. Metabolism 2009, 58, 871–876. [Google Scholar] [CrossRef] [PubMed]
  13. Bailey, S.L.; Heitkemper, M.M. Circadian rhythmicity of cortisol and body temperature: Morningness-eveningness effects. Chronobiol. Int. 2001, 18, 249–261. [Google Scholar] [CrossRef] [PubMed]
  14. Conroy, D.A.; Spielman, A.J.; Scott, R.Q. Daily rhythm of cerebral blood flow velocity. J. Circadian Rhythm. 2005, 3, 3. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. Blatteis, C.M. Age-dependent changes in temperature regulation–A mini review. Gerontology 2012, 58, 289–295. [Google Scholar] [CrossRef] [PubMed]
  16. Moghissi, K.S.; Syner, F.N.; Evans, T.N. A composite picture of the menstrual cycle. Am. J. Obstet. Gynecol. 1972, 114, 405–418. [Google Scholar] [CrossRef]
  17. Chiappini, E.; Sollai, S.; Longhi, R.; Morandini, L.; Laghi, A.; Osio, C.E.; Persiani, M.; Lonati, S.; Picchi, R.; Bonsignori, F. Performance of non-contact infrared thermometer for detecting febrile children in hospital and ambulatory settings. J. Clin. Nurs. 2011, 20, 1311–1318. [Google Scholar] [CrossRef]
  18. Teran, C.; Torrez-Llanos, J.; Teran-Miranda, T.; Balderrama, C.; Shah, N.; Villarroel, P. Clinical accuracy of a non-contact infrared skin thermometer in paediatric practice. Child Care Health Dev. 2011, 38, 471–476. [Google Scholar] [CrossRef]
  19. Ng, E.Y.K.; Acharya, R.U. Remote-sensing infrared thermography. IEEE Eng. Med. Biol. Mag. 2009, 28, 76–83. [Google Scholar] [CrossRef]
  20. Bitar, D.; Goubar, A.; Desenclos, J. International travels and fever screening during epidemics: A literature review on the effectiveness and potential use of non-contact infrared thermometers. Eurosurveillance 2009, 14, 19115. [Google Scholar] [CrossRef]
  21. Selent, M.U.; Molinari, N.M.; Baxter, A.; Nguyen, A.V.; Siegelson, H.; Brown, C.M.; Plummer, A.; Higgins, A.; Podolsky, S.; Spandorfer, P.; et al. Mass screening for fever in children: A comparison of 3 infrared thermal detection systems. Pediatr. Emerg. Care 2013, 29, 305–313. [Google Scholar] [CrossRef]
  22. Tay, M.; Low, Y.; Zhao, X.; Cook, A.; Lee, V. Comparison of Infrared Thermal Detection Systems for mass fever screening in a tropical healthcare setting. Public Health 2015, 129, 1471–1478. [Google Scholar] [CrossRef] [PubMed]
  23. Liu, C.-C.; Chang, R.-E.; Chang, W.-C. Limitations of forehead infrared body temperature detection for fever screening for severe acute respiratory syndrome. Infect. Control Hosp. Epidemiol. 2004, 25, 1109–1111. [Google Scholar] [CrossRef] [PubMed]
  24. Nguyen, A.V.; Cohen, N.J.; Lipman, H.; Brown, C.M.; Molinari, N.A.; Jackson, W.L.; Kirking, H.; Szymanowski, P.; Wilson, T.W.; Salhi, B.A.; et al. Comparison of 3 infrared thermal detection systems and self-report for mass fever screening. Emerg. Infect. Dis. 2010, 16, 1710–1717. [Google Scholar] [CrossRef]
  25. Chan, L.; Lo, J.L.; Kumana, C.R.; Cheung, B.M. Utility of infrared thermography for screening febrile subjects. Hong Kong Med. J. 2013, 19, 109–115. [Google Scholar]
  26. Hewlett, A.L.; Kalil, A.C.; Strum, R.A.; Zeger, W.G.; Smith, P.W. Evaluation of an infrared thermal detection system for fever recognition during the H1N1 influenza pandemic. Infect. Control Hosp. Epidemiol. 2011, 32, 504–506. [Google Scholar] [CrossRef]
  27. Priest, P.C.; Duncan, A.R.; Jennings, L.C.; Baker, M.G. Thermal Image Scanning for Influenza Border Screening: Results of an Airport Screening Study. PLoS ONE 2011, 6, e14490. [Google Scholar] [CrossRef] [Green Version]
  28. Cho, K.S.; Yoon, J. Fever screening and detection of febrile arrivals at an international airport in Korea: Association among self-reported fever, infrared thermal camera scanning, and tympanic temperature. Epidemiol. Health 2014, 36, e2014004. [Google Scholar] [CrossRef] [Green Version]
  29. Mouchtouri, V.A.; Christoforidou, E.P.; Lemos, C.M.; Fanos, M.; Rexroth, U.; Grote, U.; Belfroid, E.; Swaan, C.; Hadjichristodoulou, C. Exit and entry screening practices for infectious diseases among travelers at points of entry: Looking for evidence on public health impact. Int. J. Env. Res. Public Health 2019, 16, 4638. [Google Scholar] [CrossRef] [Green Version]
  30. IEC & ISO. IEC 80601-2-59: Medical Electrical Equipment-Part 2-59: Particular Requirements for the Basic Safety and Essential Performance of Screening Thermographs for Human Febrile Temperature Screening; International Electrotechnical Commission, International Organization for Standardization: Geneva, Switzerland, 2017. [Google Scholar]
  31. ISO. ISO/TR 13154: Medical Electrical Equipment—Deployment, Implementation and Operational Guidelines for Identifying Febrile Humans Using a Screening Thermograph; International Organization for Standardization: Geneva, Switzerland, 2017. [Google Scholar]
  32. Ghassemi, P.; Pfefer, T.J.; Casamento, J.P.; Simpson, R.; Wang, Q. Best practices for standardized performance testing of infrared thermographs intended for fever screening. PLoS ONE 2018, 13, e0203302. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  33. ASTM. ASTM E1965-98: Standard Specification for Infrared Thermometers for Intermittent Determination of Patient Temperature; ASTM Committee E20 on Temperature Measurement: West Conshohocken, PA, USA, 2016; p. 19428. [Google Scholar]
  34. ISO. ISO 80601-2-56: Medical Electrical Equipment-Part 2-56: Particular Requirements for Basic Safety and Essential Performance of Clinical Thermometers for Body Temperature Measurement; International Organization for Standardization: Geneva, Switzerland, 2017. [Google Scholar]
  35. Brengelmann, G. Dilemma of body temperature measurement. In Man in Stressful Environments: Thermal and Work Physiology; Shiraki, K., Yousef, M., Eds.; Charles C. Thomas: Springfield, IL, USA, 1987; pp. 5–22. [Google Scholar]
  36. Moran, D.S.; Mendal, L. Core temperature measurement. Sports Med. 2002, 32, 879–885. [Google Scholar] [CrossRef] [PubMed]
  37. Yetman, R.J.; Coody, D.K.; West, M.S.; Montgomery, D.; Brown, M. Comparison of temperature measurements by an aural infrared thermometer with measurements by traditional rectal and axillary techniques. J. Pediatr. 1993, 122, 769–773. [Google Scholar] [CrossRef]
  38. Doezema, D.; Lunt, M.; Tandberg, D. Cerumen occlusion lowers infrared tympanic membrane temperature measurement. Acad. Emerg. Med. 1995, 2, 17–19. [Google Scholar] [CrossRef]
  39. Mairiaux, P.; Sagot, J.; Candas, V. Oral temperature as an index of core temperature during heat transients. Eur. J. Appl. Physiol. Occup. Physiol. 1983, 50, 331–341. [Google Scholar] [CrossRef]
  40. Geneva, I.I.; Cuzzo, B.; Fazili, T.; Javaid, W. Normal body temperature: A systematic review. Open Forum Infect. Dis. 2019, 6, ofz032. [Google Scholar] [CrossRef]
  41. Zhou, Y.; Ghassemi, P.; Chen, M.; McBride, D.; Casamento, J.P.; Pfefer, T.J.; Wang, Q. Clinical evaluation of fever-screening thermography: Impact of consensus guidelines and facial measurement location. J. Biomed. Opt. 2020, 25, 097002. [Google Scholar] [CrossRef]
  42. Purslow, C. Clinical Implications for Thermography in the Eye World: A short History of Clinical Ocular Thermography. In Image Modeling of the Human Eye; Acharya, U.R., Ng, Y.K.E., Suri, J.S., Eds.; Artech House: New York, NY, USA, 2008; pp. 301–315. [Google Scholar]
  43. Steketee, J. Spectral emissivity of skin and pericardium. Phys. Med. Biol. 1973, 18, 686. [Google Scholar] [CrossRef]
  44. Tkáčová, M.; Živčák, J.; Foffová, P. A Reference for Human Eye Surface Temperature Measurements in Diagnostic Process of Ophthalmologic Diseases. In Proceedings of the Measurement 2011, Smolenice, Slovakia, 27–30 April 2011; pp. 406–409. [Google Scholar]
  45. Sullivan, S.J.L.; Rinaldi, J.E.; Hariharan, P.; Casamento, J.P.; Baek, S.; Seay, N.; Vesnovsky, O.; Topoleski, L.D.T. Clinical Evaluation of Non-Contact Infrared Thermometers. Res. Sq. 2021, 11, 22079. [Google Scholar] [CrossRef]
  46. Chenna, Y.N.D.; Ghassemi, P.; Pfefer, T.J.; Casamento, J.; Wang, Q. Free-form deformation approach for registration of visible and infrared facial images in fever screening. Sensors 2018, 18, 125. [Google Scholar] [CrossRef] [Green Version]
  47. Ng, D.K.; Chan, C.-H.; Chow, P.-Y.; Kwok, K.-L. Infrared ear thermometry. Br. J. Gen. Pract. 2004, 54, 869. [Google Scholar] [PubMed]
  48. Mercer, J.B.; Ring, E.F.J. Fever screening and infrared thermal imaging: Concerns and guidelines. Thermol. Int. 2009, 19, 67–69. [Google Scholar]
  49. Del Bene, V.E. Temperature. In Clinical Methods: The History, Physical, and Laboratory Examinations; Walker, H.K., Hall, W.D.H., Hurst, J.W., Eds.; Butterworth Publishers, a Division of Reed Publishing: Boston, MA, USA, 1990; pp. 990–993. [Google Scholar]
  50. Clinical and Laboratory Standards Institute. EP09c: Measurement Procedure Comparison and Bias Estimation Using Patient Samples; Clinical and Laboratory Standards Institute: Wayne, PA, USA, 2018. [Google Scholar]
  51. ISO. ISO 80601-2-61: Medical Electrical Equipment—Part 2-61: Particular Requirements for Basic Safety and Essential Performance of Pulse Oximeter Equipment; International Organization for Standardization: Geneva, Switzerland, 2017. [Google Scholar]
  52. Kumar, R.; Indrayan, A. Receiver operating characteristic (ROC) curve for medical researchers. Indian Pediatr. 2011, 48, 277–287. [Google Scholar] [CrossRef] [PubMed]
  53. Hanley, J.A.; McNeil, B.J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982, 143, 29–36. [Google Scholar] [CrossRef] [Green Version]
  54. Bland, J.M.; Altman, D.G. Comparing methods of measurement: Why plotting difference against standard method is misleading. Lancet 1995, 346, 1085–1087. [Google Scholar] [CrossRef] [Green Version]
  55. Centers for Disease Control and Prevention. Non-Contact Temperature Measurement Devices: Considerations for Use in Port of Entry Screening Activities; Centers for Disease Control and Prevention: Atlanta, GA, USA, 2014. [Google Scholar]
  56. Sullivan, S.J.; Seay, N.; Zhu, L.; Rinaldi, J.E.; Hariharan, P.; Vesnovsky, O.; Topoleski, L.T. Performance characterization of non-contact infrared thermometers (NCITs) for forehead temperature measurement. Med. Eng. Phys. 2021, 93, 93–99. [Google Scholar] [CrossRef]
  57. Fletcher, T.; Whittam, A.; Simpson, R.; Machin, G. Comparison of non-contact infrared skin thermometers. J. Med. Eng. Technol. 2018, 42, 65–71. [Google Scholar] [CrossRef]
  58. ASTM. In ASTM E1112-00: Standard Specification for Electronic Thermometer for Intermittent Determination of Patient Temperature; ASTM Committee F04 on Medical and Surgical Materials and Devices: West Conshohocken, PA, USA, 2011; p. 19428.
  59. Giavarina, D. Understanding bland altman analysis. Biochem. Med. Biochem. Med. 2015, 25, 141–151. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  60. Krouwer, J.S. Why Bland–Altman plots should use X, not (Y + X)/2 when X is a reference method. Stat. Med. 2008, 27, 778–780. [Google Scholar] [CrossRef]
  61. Usamentiaga, R.; Venegas, P.; Guerediaga, J.; Vega, L.; Molleda, J.; Bulnes, F.G. Infrared thermography for temperature measurement and non-destructive testing. Sensors 2014, 14, 12305–12348. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  62. FDA. Enforcement Policy for Telethermographic Systems During the Coronavirus Disease 2019 (COVID-19) Public Health Emergency; FDA: Silver Spring, MD, USA, 2020. Available online: https://www.fda.gov/media/137079/download (accessed on 29 October 2021).
  63. Cheung, B.; Chan, L.; Lauder, I.; Kumana, C. Detection of body temperature with infrared thermography: Accuracy in detection of fever. Hong Kong Med. J. 2012, 18 (Suppl. 3), 31–34. [Google Scholar]
  64. Simpson, R.; Machin, G.; McEvoy, H.; Rusby, R. Traceability and calibration in temperature measurement: A clinical necessity. J. Med. Eng. Technol. 2006, 30, 212–217. [Google Scholar] [CrossRef]
  65. Charles, A.C.; Janet, C.Z.; Joseph, M.R.; Martin, C.M.-E.; Elliot, D.W. Timing of REM sleep is coupled to the circadian rhythm of body temperature in man. Sleep 1980, 2, 329–346. [Google Scholar] [CrossRef] [Green Version]
  66. Harding, C.; Pompei, F.; Bordonaro, S.F.; McGillicuddy, D.C.; Burmistrov, D.; Sanchez, L.D. Fevers Are Rarest in the Morning: Could We Be Missing Infectious Disease Cases by Screening for Fever Then? medRxiv 2020. [Google Scholar] [CrossRef]
  67. Denoble, A.E.; Hall, N.; Pieper, C.F.; Kraus, V.B. Patellar skin surface temperature by thermography reflects knee osteoarthritis severity. Clin. Med. Insights. Arthritis Musculoskelet. Disord. 2010, 3, 69. [Google Scholar] [CrossRef] [Green Version]
  68. Martinez-Jimenez, M.A.; Loza-Gonzalez, V.M.; Kolosovas-Machuca, E.S.; Yanes-Lane, M.E.; Ramirez-GarciaLuna, A.S.; Ramirez-GarciaLuna, J.L. Diagnostic accuracy of infrared thermal imaging for detecting covid-19 infection in minimally symptomatic patients. Eur. J. Clin. Investig. 2020, 51, e13474. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Ambient temperature and relative humidity histogram during the clinical study. (The range between the two vertical lines indicate ideal ambient temperature/humidity based on ISO/TR 13154:2017).
Figure 1. Ambient temperature and relative humidity histogram during the clinical study. (The range between the two vertical lines indicate ideal ambient temperature/humidity based on ISO/TR 13154:2017).
Sensors 22 00215 g001
Figure 2. Delineated facial regions and critical points on thermal images [41].
Figure 2. Delineated facial regions and critical points on thermal images [41].
Sensors 22 00215 g002
Figure 3. Examples of quadratic and segmented regression methods with T m a x and T r e f as independent and dependent variables respectively for IRT-1 and IRT-2.
Figure 3. Examples of quadratic and segmented regression methods with T m a x and T r e f as independent and dependent variables respectively for IRT-1 and IRT-2.
Sensors 22 00215 g003
Figure 4. Different linear regression methods with T s k i n ( T F C , T F C m a x , T C E m a x , T m a x ) as independent variables and T r e f as dependent variable for IRT-1 and IRT-2.
Figure 4. Different linear regression methods with T s k i n ( T F C , T F C m a x , T C E m a x , T m a x ) as independent variables and T r e f as dependent variable for IRT-1 and IRT-2.
Sensors 22 00215 g004
Figure 5. Kernel density curves to estimate the probability density functions of T r e f , T F C , T F C m a x , T C E m a x and T m a x .
Figure 5. Kernel density curves to estimate the probability density functions of T r e f , T F C , T F C m a x , T C E m a x and T m a x .
Sensors 22 00215 g005
Figure 6. Scatter plots of T o r a l imputed from T m a x based on different regression methods versus T r e f for IRT-1 (Dashed lines: trend lines of T o r a l versus T r e f ; Solid lines: ideal trend lines of T o r a l = T r e f ).
Figure 6. Scatter plots of T o r a l imputed from T m a x based on different regression methods versus T r e f for IRT-1 (Dashed lines: trend lines of T o r a l versus T r e f ; Solid lines: ideal trend lines of T o r a l = T r e f ).
Sensors 22 00215 g006
Figure 7. The temperature difference between T m a x -based T o r a l and T r e f versus their average for IRT-1 in the entire temperature range (Solid lines: lines of zero difference. Dashed lines: lines of difference being ∆cb + 2σcb, ∆cb, and ∆cb − 2σ∆cb respectively).
Figure 7. The temperature difference between T m a x -based T o r a l and T r e f versus their average for IRT-1 in the entire temperature range (Solid lines: lines of zero difference. Dashed lines: lines of difference being ∆cb + 2σcb, ∆cb, and ∆cb − 2σ∆cb respectively).
Sensors 22 00215 g007
Figure 8. The ROC curves based on T o r a l imputed from T m a x and regression methods of constant offset, weighted linear and segmented linear for IRT-1. The triangle, circle and squre markers on curves show the Se/Sp values when T c u t equals T o p . c u t , T p . o p . c u t , and T t h r e s h respectively.
Figure 8. The ROC curves based on T o r a l imputed from T m a x and regression methods of constant offset, weighted linear and segmented linear for IRT-1. The triangle, circle and squre markers on curves show the Se/Sp values when T c u t equals T o p . c u t , T p . o p . c u t , and T t h r e s h respectively.
Sensors 22 00215 g008
Figure 9. The temperature difference between T m a x -based T o r a l and T r e f versus T r e f for IRT-1 in the entire temperature range (Solid lines: lines of zero difference. Dashed lines: lines of difference being ∆cb + 2σ∆cb, ∆cb, and ∆cb − 2σ∆cb respectively).
Figure 9. The temperature difference between T m a x -based T o r a l and T r e f versus T r e f for IRT-1 in the entire temperature range (Solid lines: lines of zero difference. Dashed lines: lines of difference being ∆cb + 2σ∆cb, ∆cb, and ∆cb − 2σ∆cb respectively).
Sensors 22 00215 g009
Table 1. Demographics of study subjects.
Table 1. Demographics of study subjects.
IRT-1IRT-2
Subjects%Subjects%
Female60659.4160159.50
Male41440.5940940.50
Age18–2053452.3552752.18
21–3043242.3542942.48
31–40313.04313.07
41–5090.8890.89
51–60111.08111.09
>6030.2930.30
EthnicityWhite50649.6150049.50
Black/African-American14314.0214314.16
Hispanic/Latino575.59555.45
Asian26025.4925825.54
Multiracial504.90504.95
American Indian40.3940.40
T r e f > 37.5 °C11110.8811110.99
Table 2. Study subject grouping by ambient temperature.
Table 2. Study subject grouping by ambient temperature.
Ambient Temperature (°C)Relative HumiditySubject # for IRT-1 Subject # for IRT-2
Group 1 [41]20–2410–62% (7.5% subject data in the 50–62% range)544540
Group 224–2910–62% (9.9% subject data in the 50–62% range)476470
Table 3. Pearson correlation coefficients (r values) between facial temperatures and T r e f .
Table 3. Pearson correlation coefficients (r values) between facial temperatures and T r e f .
ForeheadInner CanthiMouthFace
T F C T F T T F B T F L T F R T F C m a x T F E m a x T ¯ C L T ¯ C R T ¯ C T C m a x 1 T C L m a x T C R m a x T C m a x 2   T C E m a x T M m a x T m a x
Group 1 [41]IRT-10.460.410.490.470.430.550.630.600.580.630.650.700.710.730.750.600.78
IRT-20.460.390.490.460.410.540.620.530.510.560.590.700.690.730.760.600.79
Group 2IRT-10.500.370.520.460.430.560.600.620.610.650.660.740.750.770.790.690.81
IRT-20.500.370.530.460.420.570.610.630.560.620.650.730.720.760.800.690.82
Note: Definitions of these facial temperatures can be found in Figure 2 and our previous paper [41]. The bold font shows the best results (the highest r).
Table 4. AUC values for ROC curves based on different facial temperatures.
Table 4. AUC values for ROC curves based on different facial temperatures.
ForeheadInner CanthiMouthFace
T F C T F T T F B T F L T F R T F C m a x T F E m a x T ¯ C L T ¯ C R T ¯ C T C m a x 1 T C L m a x T C R m a x T C m a x 2   T C E m a x T M m a x T m a x
Group 1 [41]IRT-10.820.790.820.800.810.840.860.880.870.880.880.940.930.940.950.890.95
IRT-20.820.790.820.790.790.840.870.910.870.900.920.950.930.940.950.880.97
Group 2IRT-10.820.760.820.800.780.850.870.930.910.930.930.970.960.970.970.910.97
IRT-20.820.760.820.780.790.840.850.940.880.920.940.960.940.970.970.900.97
Table 5. Clinical accuracy of T o r a l measurements for IRT-1 based on T C E m a x and T m a x : Δ c b , σ Δ c b , A r m s , and σ r (unit: °C).
Table 5. Clinical accuracy of T o r a l measurements for IRT-1 based on T C E m a x and T m a x : Δ c b , σ Δ c b , A r m s , and σ r (unit: °C).
T o r a l   Based   on   T C E m a x T o r a l   Based   on   T m a x
OffsetOrdinaryDemingWeightedBinningSegmentedOffsetOrdinaryDemingWeightedBinningSegmented
All Δ c b −0.03−0.03−0.030.21−0.13−0.03−0.02−0.02−0.020.22−0.09−0.03
Tref σ Δ c b 0.400.350.370.350.470.300.350.330.340.330.410.29
A r m s 0.400.350.370.410.490.300.350.330.340.390.420.29
Tref< Δ c b 0.050.110.070.34−0.100.100.050.100.070.34−0.060.10
37 °C σ Δ c b 0.370.290.340.290.450.220.330.270.320.270.400.21
A r m s 0.380.300.350.450.460.240.340.290.320.440.410.23
37°C≤ Δ c b −0.14−0.19−0.160.05−0.19−0.21−0.12−0.17−0.130.08−0.14−0.20
Tref σ Δ c b 0.400.300.360.300.500.300.350.280.330.290.430.28
≤38.5 °C A r m s 0.420.350.390.310.530.370.370.330.350.300.450.35
Tref> Δ c b −0.42−0.91−0.58−0.62−0.12−0.39−0.49−0.87−0.58−0.61−0.18−0.39
38.5 °C σ Δ c b 0.260.240.240.230.360.350.230.220.220.220.310.36
A r m s 0.480.930.620.650.360.510.530.900.620.650.340.52
σ r 0.110.080.100.090.140.070.180.140.170.140.220.13
Note: The bold font shows the best results (i.e., minimum values of Δ c b , σ Δ c b , A r m s , and σ r ).
Table 6. Clinical accuracy of T o r a l measurement for IRT-2 based on T C E m a x and T m a x : Δ c b , σ Δ c b , A r m s , and σ r (unit: °C).
Table 6. Clinical accuracy of T o r a l measurement for IRT-2 based on T C E m a x and T m a x : Δ c b , σ Δ c b , A r m s , and σ r (unit: °C).
T o r a l   Based   on   T C E m a x T o r a l   Based   on   T m a x
OffsetOrdinaryDemingWeightedBinningSegmentedOffsetOrdinaryDemingWeightedBinningSegmented
All Δ c b 0.020.030.030.25−0.030.020.010.020.020.19−0.040.01
Tref σ Δ c b 0.420.320.350.330.420.290.380.310.320.320.390.27
A r m s 0.420.320.350.410.420.290.380.310.320.370.390.27
Tref< Δ c b 0.060.150.110.350.000.150.050.140.100.27−0.010.14
37 °C σ Δ c b 0.440.290.350.320.440.230.390.270.320.320.400.22
A r m s 0.440.330.370.470.440.270.400.300.340.420.400.26
37 °C≤ Δ c b −0.05−0.14−0.100.10−0.10−0.20−0.05−0.14−0.100.06−0.10−0.19
Tref σ Δ c b 0.380.260.300.280.380.230.350.250.280.280.350.22
≤38.5 °C A r m s 0.380.290.310.290.400.300.350.280.300.280.370.29
Tref> Δ c b 0.25−0.58−0.25−0.170.21−0.090.14−0.57−0.28−0.130.11−0.19
38.5 °C σ Δ c b 0.390.220.280.250.390.470.360.210.270.260.370.38
A r m s 0.440.620.360.290.420.450.360.610.380.280.360.41
σ r 0.150.090.110.100.150.070.220.150.180.180.230.12
Note: The bold font shows the best results (i.e., minimum values of Δ c b , σ Δ c b , A r m s , and σ r ).
Table 7. Diagnostic accuracy of IRT-1 and IRT-2 based on T o r a l imputed from T C E m a x and T m a x : Se/Sp and d S e S p .
Table 7. Diagnostic accuracy of IRT-1 and IRT-2 based on T o r a l imputed from T C E m a x and T m a x : Se/Sp and d S e S p .
T o r a l   Based   on   T C E m a x T o r a l   Based   on   T m a x
OffsetOrdinaryDemingWeightedBinningSegmentedOffsetOrdinaryDemingWeightedBinningSegmented
Se0.730.610.730.890.730.610.740.600.710.880.760.55
IRT-1Sp0.940.970.950.870.940.970.940.980.950.890.930.99
d S e S p 0.280.390.280.180.280.390.270.410.290.170.250.45
Se0.840.680.750.860.840.660.810.670.770.840.790.58
IRT-2Sp0.910.980.960.850.940.990.940.990.970.890.950.99
d S e S p 0.180.320.250.200.170.340.200.330.230.200.220.42
Note: The bold font shows the best results ( d S e S p   ≤ 0.20).
Table 8. Maximum permissible errors defined in different standards.
Table 8. Maximum permissible errors defined in different standards.
StandardsDevices (Required Minimum Display Range)Maximum Permissible Errors, in Specific Temperature Ranges Accuracy Type (Laboratory/Clinical)Note
IEC 80601-2-59: 2017 [30]IRTs
(None)
±0.5 °C, 34.0–39.0 °C.LaboratoryErrors from all the test devices are combined.
ISO 80601-2-56: 2017 [34]clinical thermometers
(34.0–43.0 °C)
±0.3 °C, withing the rated output range;
±0.4 °C, withing the rated extended output range.
LaboratoryThis standard is under revision for improvement.
ASTM E1112-00: 2011 [58]electronic thermometers
(35.5–41.0 °C)
±0.3 °C, < 35.8 °C;
±0.2 °C, 35.8–37.0 °C;
±0.1 °C, 37.0–39.0 °C;
±0.2 °C, 39.0–41.0 °C;
±0.3 °C, > 41.0 °C.
Not clear
ASTM E1965-98: 2016 [33]IR thermometers
(Ear canal: 34.4–42.2 °C;
Skin: 22.0–40.0 °C)
For ear canal IR thermometers:
±0.3 °C, < 36.0 °C;
±0.2 °C, 36.0–39.0 °C;
±0.3 °C, > 39.0 °C.
For skin IR thermometers:
±0.3 °C, over the display range.
Laboratory
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Wang, Q.; Zhou, Y.; Ghassemi, P.; McBride, D.; Casamento, J.P.; Pfefer, T.J. Infrared Thermography for Measuring Elevated Body Temperature: Clinical Accuracy, Calibration, and Evaluation. Sensors 2022, 22, 215. https://doi.org/10.3390/s22010215

AMA Style

Wang Q, Zhou Y, Ghassemi P, McBride D, Casamento JP, Pfefer TJ. Infrared Thermography for Measuring Elevated Body Temperature: Clinical Accuracy, Calibration, and Evaluation. Sensors. 2022; 22(1):215. https://doi.org/10.3390/s22010215

Chicago/Turabian Style

Wang, Quanzeng, Yangling Zhou, Pejman Ghassemi, David McBride, Jon P. Casamento, and T. Joshua Pfefer. 2022. "Infrared Thermography for Measuring Elevated Body Temperature: Clinical Accuracy, Calibration, and Evaluation" Sensors 22, no. 1: 215. https://doi.org/10.3390/s22010215

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop