The Predictive Value of the NICE “Red Traffic Lights” in Acutely Ill Children

Objective Early recognition and treatment of febrile children with serious infections (SI) improves prognosis, however, early detection can be difficult. We aimed to validate the predictive rule-in value of the National Institute for Health and Clinical Excellence (NICE) most severe alarming signs or symptoms to identify SI in children. Design, Setting and Participants The 16 most severe (“red”) features of the NICE traffic light system were validated in seven different primary care and emergency department settings, including 6,260 children presenting with acute illness. Main Outcome Measures We focussed on the individual predictive value of single red features for SI and their combinations. Results were presented as positive likelihood ratios, sensitivities and specificities. We categorised “general” and “disease-specific” red features. Changes in pre-test probability versus post-test probability for SI were visualised in Fagan nomograms. Results Almost all red features had rule-in value for SI, but only four individual red features substantially raised the probability of SI in more than one dataset: “does not wake/stay awake”, “reduced skin turgor”, “non-blanching rash”, and “focal neurological signs”. The presence of ≥3 red features improved prediction of SI but still lacked strong rule-in value as likelihood ratios were below 5. Conclusions The rule-in value of the most severe alarming signs or symptoms of the NICE traffic light system for identifying children with SI was limited, even when multiple red features were present. Our study highlights the importance of assessing the predictive value of alarming signs in clinical guidelines prior to widespread implementation in routine practice.


Introduction
Fever is one of the most common symptoms among children presenting to ambulatory care. [1][2][3] The majority of children presenting with an acute illness to ambulatory care will have selflimiting viral infections, with only a small proportion having a serious infection (SI). [1,[4][5][6] Early recognition and treatment of children with SI are related to better prognosis, [7,8] however identification of SI at first presentation can be difficult.
The National Institute for Health and Clinical Excellence (NICE) 2013 guideline for the management of children with feverish illness provides comprehensive guidance on the assessment, investigation and management of children presenting at different settings, including primary care and pediatric specialty settings. [6,9] One of the key elements of the guideline is a ''traffic light'' system for the diagnostic assessment of children under five years of age presenting with a feverish illness. This evidence and consensus-based system includes clinical features identified from existing scoring systems for acutely ill children, [10][11][12][13] and disease-specific signs and symptoms. Children with the most alarming (or ''red'') features are considered at higher risk of SI, for whom subsequent management includes invasive investigations, treatment, and hospital admission.
As one of the few evidence-based guidelines for children with fever [14,15] and the only for both primary and secondary care, the NICE febrile child guideline has been implemented in many settings in not only the United Kingdom but also in other countries. Recently, two studies reported low specificities for the approach that any abnormal amber or red feature would indicate possible SI. [16,17] This could be due to the inclusion of amber features, whose association with SI may be weaker.
In this study we aimed to determine the predictive ("rule-in") value of the red features of the NICE traffic light system, both for the individual red features as their combinations for identifying children with SI in various acute pediatric settings in Europe.

Identification of datasets
We used data on seven independent cohorts [4,[18][19][20][21][22][23] collected by collaborators of the European Research Network on recognising serious InfEctions (ERNIE) group. [24] Data were prospectively collected at first contact using standardised (site-specific) documentation of patient characteristics, except for Monteny et al [19] where data was collected using structured clinical proformas separate from the consultation. All datasets were cohort studies of children in various age ranges (0-16 years), presenting to ambulatory care settings (i.e. general or family practice, pediatric outpatient clinic, pediatric assessment unit or emergency department) with an acute illness or infection.

Ethical approval
This research conforms to the Helsinki Declaration and to local legislation. The original study authors have all agreed to share their data, and had obtained ethical approval from their local research ethics committees for the initial data collection, prior to this study.

Processing of included datasets
Key characteristics of each dataset are shown in table 1. We selected children under the age of five years with an acute illness based on general symptoms [4,21,22] or specifically on the presence of fever [18][19][20]23], as this is the target group of the NICE guideline (table 1).
The NICE traffic light system includes 16 red features, which are categorised into 5 main domains: Colour (1 red feature), Activity (4 red features), Respiratory (3 red features), Hydration (1 red feature), and Other (7 red features). [6,9] When study variables were not entirely identical to the red features in the NICE febrile child guideline, we identified proxies where possible. Identification and handling of variables has been described earlier [17], a full list of all approximations is described in table S1. When a red feature was not recorded in the dataset and no suitable proxy was identified, this item was excluded from that specific dataset. Table  S2 outlines the unrecorded and missing data from each dataset separately.
Missing values were not imputed because the necessary missingat-random assumption was likely to be incorrect. We considered red features that were ''not documented'' in individual patient records as ''absent'', given that the red feature or its proxy was recorded in that particular dataset. [17] The translation, recoding and data-checking were performed by two authors (EK, JV) and the results of each step were discussed with all primary study authors. [17] Outcome measures Serious infections (SI) were defined as sepsis (including bacteremia), meningitis, pneumonia, osteomyelitis, cellulitis, and complicated urinary tract infections. [25] Serious infections (SI) were not only based on clinical diagnosis, but reference standard test criteria were used to determine final diagnoses of SI. Detailed description on these reference standard test criteria are available in the original study papers. [4,[18][19][20][21][22][23] Assessment of the diagnoses to ensure comparability of outcomes was discussed with the lead investigator of each study as described earlier. [17] Statistical analysis The individual red features were analysed in every dataset separately. Additionally, results were categorised as ''general'' red features (items 1-7 and 9-10) and ''disease-specific'' red features (items 8 and 11-16).
We assessed the rule-in value for SI for each red feature separately by calculating positive likelihood ratios (LR+). Red features were considered to have rule-in value if they raised the probability of illness with a positive likelihood ratio of more than 5.0. [25] The univariable association between each individual red feature and the presence of SI was tested by Chi-square analysis. Likelihood ratios, sensitivity and specificity were measured for the presence of $1 RTL, $2 RTLs and $3 RTLs. The sensitivity and specificity for ''general'' and ''disease-specific'' red features were plotted in receiver operating characteristic (ROC) space.
The incremental diagnostic value for up to more than four red features compared to one red feature was evaluated by logistic regression analyses with forward selection (Wald test, p-value ,0.05).
We visualised the change in pre-test probability versus post-test probability for SI in a Fagan nomogram. [26] No overall pooled likelihood ratios were calculated because of the substantial clinical heterogeneity between datasets (differences in setting, inclusion criteria, immunisation schedules and definition of serious infection). [17] All analyses were done with SPSS software (version 20.0, SPSS Inc, Chicago).

Included datasets
We selected 6,260 children under five years of age of seven preexisting datasets (n = 6,260/10,812, 58%) for diagnostic studies in children with an acute illness (table 1). Children were included based on fever, [19,20,23] acute illness, [4,18] acute infection, [21] and referral for meningeal signs. [22] Children with various severities of co-morbidity were excluded in five studies, [4,[19][20][21][22][23], one study excluded children if the acute episode was caused by an exacerbation of a chronic condition [4] and one study excluded children who required immediate resuscitation [18] (table 1). All studies included sepsis, meningitis, pneumonia and complicated urinary tract infections in their outcome definition. Osteomyelitis and cellulitis were explicitly mentioned in five and three datasets, respectively.
The median age of the selected children ranged from 0.8 years to 1.9 years. The prevalence of SI ranged from 1.2% to 4.1% in two datasets from general practice [4,19] and from 9.3% to 40.2% in five datasets from emergency departments and a pediatric assessment unit [18,[20][21][22][23].

Red traffic lights included in the datasets
Data on all red features included in domains ''Colour'' and ''Hydration'' were available in all datasets. The red features ''no response to social cues'', and ''weak, high-pitched or continuous cry'' of domain ''Activity'' were not recorded in two [20,23], and one dataset [18], respectively. Other red features in this domain were available in all datasets. Red features related to the ''Respiratory'' domain were not recorded in four (''grunting'') [4,[21][22][23], one (''tachypnoea'') [22], and two (''chest indrawing'') [22,23], datasets respectively. ''Disease-specific'' red features (items 8 and 11-16) were recorded less frequently in all datasets but in particular in low prevalence settings (range missing values 0-50%), see table S2). Table 2 shows positive and negative likelihood ratios of the 16 individual red features for each dataset separately. All red features with high rule-in value (LR+ .5) are highlighted in bold.

Performance of individual red traffic lights
Four of all 16 red features did not achieve high rule-in value (LR+ ,5) including two red features which were not available in the datasets or were not reaching significance (p,0.05) when present.

Performance of multiple red traffic lights
The association between SI and the number of positive red features with the performance measures of positive likelihood ratios, sensitivity and specificity is shown in table 3. We measured the maximum predictive value of multiple red features by logistic regression analysis and the slope of the ROC-curve. We noted a significant increase of rule-in value with the number of positive red features in most datasets (range LR+2.1 -10.0 when $3 red features), with the exception of Monteny et al. [19] (p-value ,0.05). This was also observed in the increased values of specificity when more red features were present. The presence of 4 or more red features did not contribute to discriminative value compared to up to 3 red features. The proportion of children having $3 red features ranged from 2% to 50% and did not differ between low and high prevalence settings. ''General'' red features were almost entirely responsible for the total ROC-area (table 3). We did not test disease-specific red features on disease-specific outcome measures due to the small numbers of these events. In figure 1 we visualised the change in pre-test to post-test probability for SI when three or more (general or disease-specific) red features were present in a Fagan nomogram. [27] For example, the 9% pretest probability of having a SI for a child in the Brent et al dataset increases to 28% (95% CI 17-42%) post-test probability when having three or more red features, but decreases only to 7% (95% CI 6-9%) if less than three red features were present.

Main findings
This is the first study on broadly validating the diagnostic performance of the individual red features and their combinations of the NICE febrile child guideline in acutely ill children in various settings in Europe. Although we observed rule-in value for almost all individual red features in at least one dataset, only four red features raised the probability of SI with a positive likelihood ratio of more than 5.0 in more than one setting: ''does not wake or if roused does not stay awake'', ''reduced skin turgor'', ''nonblanching rash'', and ''focal neurological signs''. Children with more than one red feature had an increased risk of SI, however, more than three red features did not further increase disease probability.

Comparison with other studies
To our knowledge there are three previous studies that estimated the predictive value of any amber or red feature for the detection of SI, but they did not evaluate the individual features of the NICE traffic light system separately. De et al. [16] found that the NICE traffic light system failed to identify a substantial proportion of children with serious bacterial infections. Combining the amber and red feature categories resulted in a sensitivity of 85.8% and specificity of 28.5% for the detection of any serious bacterial infections. Within the original data of Thompson et al. the diagnostic value of vital signs and the NICE traffic light system for identifying children with SI was assessed in a pediatric assessment unit. [21] They stated that the presence of one or more amber and red features was 85% sensitive, but only 29% specific in identifying serious or intermediate infections. [21] However, this original study was performed in children up to 16 years of age in contrast to this present study limited to children up to 5 years of age. Finally, a previous study assessing the diagnostic value of any abnormal amber or red feature (not considering combinations) of the NICE traffic light system to rule-out SI, had sensitivity of 97-100% in low and intermediate prevalence settings and 87-99% in high prevalence settings. [17] The results of all three validation studies suggest possible clinical value for rulingout SI using both amber and red features, but at the expense of a large group of children testing false positive. However, up to 15% of children with a serious infection will be missed. Alternatively, the presence of any amber or red feature does not allow ruling-in SI considering the very low specificity. In low prevalence settings, alarming signs are preferably highly sensitive to correctly rule-out SI in order to limit incorrect referral. [24] In high prevalence settings specificity is more important because a high rate of false positive children could result in high admission rates and unnecessary investigations. [24] Unfortunately there was too much heterogeneity in our datasets to stratify according to prevalence.

Clinical and research implications
With decreasing incidence of SI, clinicians may increasingly rely on alarming symptoms described in (inter)national clinical guidelines. Broad validation could support the wider adoption of the NICE guideline in various settings in Europe and other highincome countries. Although the traffic light system of the NICE febrile child guideline is mostly based on systematic literature reviews and consensus, only four red features achieved high rule-in value in more than one dataset and none of them across all settings. Moreover, in at least as many datasets these four red features did not achieved high rule-in value and therefore hampers strong conclusions.  The rule-in value of several other red features was not confirmed in multiple settings either, questioning their inclusion in this setting-independent traffic light system.
Our observations of varying rule-in values of red features in the 7 databases did not support the development of one prediction model including the most important red features. However, we consistently observed an association between 3 or more red features and SI but combinations of red features will never be able to definitely rule-in a SI without uncertainty. This could be due to dilution of their accuracy by the inclusion of aspecific red features or because of the interaction between different red features.
The relatively lower recording of ''disease-specific'' features hampered our analyses, in particular in low prevalence settings. This may in part have been caused by the fact that it is more difficult to identify proxies for such features, in contrast to more general features.
The main findings in our study corresponds with the limited performance of the Yale Observation Scale, on which the NICE traffic light system is partly based. [17,25] In the revised 2013 guideline [9] two red features were deleted of the previous 2007 protocol 6 or transferred to amber features: ''Age 3-6m & temperature $39uC'' and ''bile-stained vomiting''. This is supported by our findings that we did not find rule-in value for the former but only had one dataset available for the latter which showed high rule-in value though. Next, as disease specific red features are strongly related to specific but rare diseases, their positive documentation rate is already expected to be low. Although these disease specific red features may be relevant for one specific outcome, it is difficult to evaluate these in the general population of fever with a broad differential diagnosis. However, achieving complete certainty with clinical features is not the goal here. Rather, red features should lift the probability of SI over a certain decision threshold: either to refer, request additional testing or start empiric treatment. As we do not know at what specific risk thresholds we (intuitively) undertake action, clinical interpretation of post-test probabilities as expressed in Fagan nomograms (figure 1) remains difficult. As diagnosis assessment is a dynamic process and may be influenced by evolution of symptoms in time, repeated assessment of deviating red features in those with only one or two features in particular, may improve the evaluation of SI.
Finally, the NICE traffic light system could also be improved by taking more recent evidence into account, such as on peripheral circulation, parental concern [25] or urine analysis [16].  Table 3. Likelihood ratios and ROC-areas of combinations of multiple red traffic lights.

Strengths and limitations
We assessed the NICE red traffic lights in 6,260 children from seven existing datasets with various pediatric populations and settings including two low prevalence primary care settings, which are usually underrepresented in diagnostic studies in this area. [24] In addition, we validated the red features separately to identify their individual predictive value.
Despite the large amount of data, not all red features had been recorded in all datasets, necessitating the use of proxy variables. [17] Furthermore, differences in population characteristics (table 1), such as age distribution or prevalence of specific diagnoses within the group of SI, prevented the calculation of overall diagnostic performance measures.
Furthermore, by assuming missing red features as not present and more complete documentation of red features in ill children, we may have overestimated our likelihood ratios by increasing the contrast between children with and without SI.
However, the variability in variables and case-mix reflects clinical practice and therefore will strengthen generalizability of our results.

Conclusion
Our results support rule-in value of several individual red features from the NICE febrile child guideline in specific settings, although not consistent. However most features had little rule-in value across multiple settings. The NICE red traffic lights, even when three or more features are present, seem to have limited value for ruling-in serious infections. Our results underline the importance to widely validate the predictive value of individual and combinations of multiple red features in clinical guidelines, prior to widespread dissemination and adoption.