Impact of detecting potentially serious incidental findings during multi-modal imaging

Background: There are limited data on the impact of feedback of incidental findings (IFs) from research imaging. We evaluated the impact of UK Biobank’s protocol for handling potentially serious IFs in a multi-modal imaging study of 100,000 participants (radiographer ‘flagging’ with radiologist confirmation of potentially serious IFs) compared with systematic radiologist review of all images. Methods: Brain, cardiac and body magnetic resonance, and dual-energy x-ray absorptiometry scans from the first 1000 imaged UK Biobank participants were independently assessed for potentially serious IFs using both protocols. We surveyed participants with potentially serious IFs and their GPs up to six months after imaging to determine subsequent clinical assessments, final diagnoses, emotional, financial and work or activity impacts. Results: Compared to systematic radiologist review, radiographer flagging resulted in substantially fewer participants with potentially serious IFs (179/1000 [17.9%] versus 18/1000 [1.8%]) and a higher proportion with serious final diagnoses (21/179 [11.7%] versus 5/18 [27.8%]). Radiographer flagging missed 16/21 serious final diagnoses (i.e., false negatives), while systematic radiologist review generated large numbers of non-serious final diagnoses (158/179) (i.e., false positives). Almost all (90%) participants had further clinical assessment (including invasive procedures in similar numbers with serious and non-serious final diagnoses [11 and 12 respectively]), with additional impact on emotional wellbeing (16.9%), finances (8.9%), and work or activities (5.6%). Conclusions: Compared with systematic radiologist review, radiographer flagging missed some serious diagnoses, but avoided adverse impacts for many participants with non-serious diagnoses. While systematic radiologist review may benefit some participants, UK Biobank’s responsibility to avoid both unnecessary harm to larger numbers of participants and burdening of publicly-funded health services suggests that radiographer flagging is a justifiable approach in the UK Biobank imaging study. The potential scale of non-serious final diagnoses raises questions relating to handling IFs in other settings, such as commercial and public health screening.


Abstract
: There are limited data on the impact of feedback of incidental Background findings (IFs) from research imaging. We evaluated the impact of UK Biobank's protocol for handling potentially serious IFs in a multi-modal imaging study of 100,000 participants (radiographer 'flagging' with radiologist confirmation of potentially serious IFs) compared with systematic radiologist review of all images.
: Brain, cardiac and body magnetic resonance, and dual-energy x-ray Methods absorptiometry scans from the first 1000 imaged UK Biobank participants were independently assessed for potentially serious IFs using both protocols. We surveyed participants with potentially serious IFs and their GPs up to six months after imaging to determine subsequent clinical assessments, final diagnoses, emotional, financial and work or activity impacts.
: Compared to systematic radiologist review, radiographer flagging Results resulted in substantially fewer participants with potentially serious IFs (179/1000 [17.9%] versus 18/1000 [1.8%]) and a higher proportion with serious final diagnoses (21/179 [11.7%] versus 5/18 [27.8%]). Radiographer flagging missed 16/21 serious final diagnoses (i.e., false negatives), while systematic radiologist review generated large numbers of non-serious final diagnoses (158/179) (i.e., false positives). Almost all (90%) participants had further clinical assessment (including invasive procedures in similar numbers with serious and non-serious final diagnoses [11 and 12 respectively]), with additional impact on emotional wellbeing (16.9%), finances (8.9%), and work or activities (5.6%). Introduction UK Biobank (www.ukbiobank.ac.uk) is a major resource for research into the determinants of a wide range of serious and life-threatening diseases, to improve their prevention, diagnosis and treatment 1 . It is a prospective study which recruited 500,000 men and women aged 40-69 across the UK between 2006 and 2010 1 . It includes extensive questionnaire and physical measurement data from the baseline visit, biological samples (with genotyping and biomarker assay data), longitudinal follow-up data from national health-related datasets and additional information from remote monitoring and web-based questionnaires.
The UK Biobank imaging study aims to perform brain, cardiac and body magnetic resonance imaging (MRI), dual-energy x-ray absorptiometry (DXA) and carotid Doppler ultrasound in 100,000 UK Biobank participants in dedicated imaging centres over seven years (http://imaging.ukbiobank.ac.uk/). By November 2017, over 20,000 participants had attended an imaging assessment visit (http://imaging.ukbiobank.ac.uk/), making it already the world's largest ever multi-modal imaging study 2 .
Incidental findings (IFs), defined as 'findings discovered in the course of research that are beyond the aims of the study,' 3 are a predictable consequence of much research, and studies need appropriate protocols for handling them (https://wellcome.ac.uk/ funding/managing-grant/wellcome-trust-policy-position-healthrelated-findings-research/) 4 . IFs are particularly pertinent to the UK Biobank imaging study given its large scale and the potential seriousness of IFs that may be detected. While clinical care and screening programmes aim to provide clinical benefit to patients, research studies have the primary aim of producing generalisable knowledge. Nevertheless, while research studies do not aim to benefit participants directly, they are obliged to minimise potential harms to participants and the wider public. Hence, although the UK Biobank imaging study aims to collect research data, rather than to detect or diagnose serious disease, it does require a protocol to handle IFs should they arise.
The UK Biobank imaging IFs protocol was developed as a pragmatic, scalable process, aiming to produce the best possible resource for biomedical research while minimising any potential harms for 100,000 largely asymptomatic UK Biobank participants. UK Biobank reviewed current practice, the extensive literature 3,5,6 and relevant published guidance (https://www.rcr. ac.uk/publication/management-incidental-findings-detectedduring-research-imaging), sought independent legal advice, and consulted with its independent Ethics 2 . Key contextual factors considered were the non-clinical setting of the imaging visit, in which the scanning sequences are optimised for research use rather than clinical diagnosis, and the nature of the participants' existing consent (in particular the approach to the feedback of IFs). However, cost effectiveness was not considered relevant 2 .
The UK Biobank imaging IFs protocol involves feedback to participants and their general practitioners (GPs) when a radiographer observes a potentially serious IF during image acquisition that is subsequently confirmed by a specialist radiologist. UK Biobank defines a potentially serious IF for these purposes as one indicating the possibility of a condition which, if confirmed, would carry a real prospect of seriously threatening life span, or of having a substantial impact on major body functions or quality of life.
The need for evidence to inform IFs policy Limited data exist on the impact of feedback of IFs on participants and health services [8][9][10][11] , and on how these vary by different policies for handling IFs. Most published data on opinions of receiving such feedback are based on hypothetical scenarios, rather than studies of research participants who have actually received feedback [12][13][14] . It is often assumed that early observation on imaging of presumed disease (prior to clinical presentation) is inevitably beneficial, but data on final clinical diagnosis and the impact of feedback of IFs are scarce 15 . Such data would inform debates about these assumptions, and the design of appropriate, acceptable protocols to handle IFs detected in research, public health screening or commercial imaging settings.
In this evaluation of the first 1000 participants in the UK Biobank imaging study, we assessed the number and types of potentially serious IFs detected and their final clinical diagnoses, comparing the UK Biobank imaging IFs protocol with systematic radiologist review of all of the images. We also assessed the impact of providing feedback about potentially serious IFs on participants, their friends, families and health services, with respect to: clinical assessments undertaken; emotional wellbeing, finances, work and daily activities; and participants' and their general practitioners' (GP) opinions about receiving feedback. Two protocols for handling IFs Images from the first 1000 participants were assessed using two protocols which ran simultaneously. Under the UK Biobank IFs protocol ('radiographer flagging'), if a radiographer noticed a potentially serious IF during image acquisition and quality assessment, the relevant set of images was flagged for subsequent review by a radiologist. Under 'systematic radiologist review', all images were systematically reviewed by a radiologist. Radiographers were trained in the relevant imaging protocols but did not receive specific training in image interpretation as UK Biobank is a research resource and conducts research imaging. The radiographers were not instructed to actively look for, or to avoid looking for IFs; rather, they were instructed that should they happen to notice a concerning finding, they should flag it for review. As such, UK Biobank does not aim to provide any form of health service, including image interpretation. Radiologists and radiographers were aware of the comparison study, but were blind to each other's opinions. To aid interpretation of images assessed either during systematic radiologist review, or those flagged by radiographers, we provided reporting radiologists with data collected during the imaging visit on the participant's age, sex, body mass index, self-reported smoking status, alcohol consumption, medical history and medications.

Participants
Within a few weeks of their imaging visit, we wrote to all participants who had a potentially serious IF reported by a radiologist, whether it had been both flagged by a radiographer and confirmed by a radiologist (radiographer flagging) or detected by a radiologist during systematic review of all images (systematic radiologist review). We explained that a potentially serious abnormality (or, sometimes, abnormalities) had been observed, and advised the participant to visit his/her GP for advice about any further action required (Supplementary File 4). We also wrote to these participants' GPs, providing a copy of the radiologist's report and, if requested, copies of the relevant scans (Supplementary File 5).

Qualitative study
To provide additional context, UK Biobank commissioned a social research company (TNS-BMRB; www.tns-bmrb.co.uk) to conduct a parallel qualitative study. This aimed: (1) to explore participants' understanding of and opinions about the process of consent relating to feedback of potentially serious IFs through deliberative group discussions with two groups of around 10 participants each (a more and a less affluent group) prior to their imaging assessment; and (2) to assess views on the process and impact of receiving feedback through one-to-one interviews with 15-20 participants (including more and less affluent male and female participants) with IFs on different imaging modalities, and with both serious and non-serious final clinical diagnoses. Further details of the methods of recruitment, interview content and qualitative analysis methods are available at http://www. ukbiobank.ac.uk/resources/.

Statistical analyses
We summarised data from questionnaires as counts and proportions. We compared groups using chi-squared or Fisher's exact tests for proportions and Student's independent t-test for continuous variables. We considered p values of <0.05 to be statistically significant and analysed data using Microsoft Excel 2013 and SPSS Statistics version 21.
Ethics approval UK Biobank obtained approval specifically for the imaging study, participant information and consent materials and this evaluation, including surveying participants and their GPs (North West Research Ethics Committee, Reference Number: 11/NW/0382).

Follow-up questionnaires
Each of the three follow-up questionnaires was returned for ≥70% of 179 participants with a potentially serious IF; at least one questionnaire was returned for 93.3% and all three for 45.8% Figure 1. Participant flowchart. MRI = magnetic resonance imaging, DXA = dual energy x-ray absorptiometry. 1 68 participants had incomplete imaging: 18 underwent DXA but not MRI due to safety issues, 50 did not complete all MRI (28 due to claustrophobia, 13 due to scanner failure, nine for other reasons). 2 Final diagnosis assigned to participants with more than one potentially serious incidental finding was the most serious (serious>uncertain>not serious). 3 Three of these participants had uncertain final diagnoses, see Supplementary File 7.
( Table 2). Denominators varied for different types of clinical assessment and impact due to different proportions of completed responses to the relevant questions (Table 3).

Clinical assessment
All participants with follow-up questionnaire data had contacted their GP. Almost all had some form of clinical assessment (153/170 [90.0%]), most frequently blood tests (29.4%), further imaging (78.8%) or specialist referral (64.1%), with smaller proportions having other tests (8.8%), change of medication (10.5%) or an invasive procedure or operation (14.2%) ( Table 3). The proportions having each type of clinical assessment were generally higher for those with a serious compared with non-serious final diagnosis, particularly medication changes (44.4% serious versus 6.3% non-serious) and invasive procedures (61.1% versus 8.3%). However, the absolute numbers having clinical assessment were far higher among the many more participants with non-serious final diagnoses. Of the 153 participants reporting some form of clinical assessment, 133 had a non-serious final diagnosis, suggesting that further clinical assessment might not have been necessary (Table 3).
Of particular note, similar absolute numbers of participants had invasive, potentially harmful, procedures irrespective of PSIF = potentially serious incidental finding, MRI = magnetic resonance imaging, DXA = dual energy X-ray absorptiometry 1 Includes three participants whose final diagnoses remained uncertain as of April 2016: one participant with a lung nodule was still under assessment; another participant with a lung nodule had been diagnosed with lymphoma, but it remained unclear whether the nodule was related to the lymphoma or not; and we were unable to contact one participant to determine the final diagnosis of DXA appearances suggesting a crush fracture. 2 Fifteen participants had more than one non-serious final diagnosis arising from more than one modality.

Impact on participants
Feedback about a potentially serious IF also had an impact (presumed to be adverse) on participants' emotional wellbeing (21/124, 16.9%), insurance or finances (11/124, 8.9%), and work or activities of daily living (7/124, 5.6%). The proportion of participants reporting an impact on emotional wellbeing was higher among those with a serious final diagnosis, but the absolute numbers were higher among those with a non-serious final diagnosis, for whom these impacts could be considered to constitute net harm (Table 3). In addition to the 21 reporting an impact on emotional wellbeing in response to the relevant survey question, participants and/or their GPs spontaneously mentioned worry within questionnaire free-text responses for a further 62 participants (examples shown in Box 1).

Box 1. QUOTATIONS FROM PARTICIPANTS AND THEIR GENERAL PRACTITIONERS (GP)
Participant with a non-serious final diagnosis, six-week questionnaire: "Better to know, but I did feel anxious for a few weeks."  Denominators vary due to differences in questionnaire return rates and whether or not the relevant questions had been answered on returned questionnaires. 5 Any impact on either the emotional wellbeing of the participant, their friends, or their family, or on family life (combined responses across several related questions) 6 Any impact on either the cost or ability of obtaining travel, or health or life insurance or on their overall financial situation (combined responses across several related questions) 7 Any impact on having to take time off work, change job or retire, or have help for activities of daily living (combined responses across several related questions) 8 This question was asked on both the six-week and the six-month participant questionnaires. 145 participants answered the question at least once and formed the denominator. 98 of 100 who answered the question both times did so consistently (they were glad to have been told on both occasions) and were included in the numerator. Two of these 100 participants (both with final non-serious diagnoses) gave different answers on each questionnaire (one was glad to have been told at six weeks, but by six months would rather not have been told, while the other would rather not have been told at six weeks but was glad to have been told at six months). One further participant (who returned a single six-month participant questionnaire) reported that they would rather not have been told about their potentially serious IF, which was finally diagnosed as a non-serious condition. 9 This question was asked on both the six-week and the six-month participant questionnaires. 148 participants answered the question at least once and formed the denominator. Answers from the 101 participants who returned both questionnaires and answered the question both times were all consistent 10 This question was asked on both the six-week and the six-month participant questionnaires. 149 participants answered the question at least once and formed the denominator. 69 of the 105 participants who answered the question both times did so consistently and were included in the numerator. 36 gave different answers on each questionnaire: 22 changed their view from 'should always be told' to 21 'should be able to choose' and one 'no opinion'; 14 changed from 'should be able to choose,' to 11 'should always be told' and three 'other option'.  (Table 3). Since participants were asked both at six weeks and at six months about this, we were able to assess whether the answers of 105 participants who responded on both occasions changed over time. While 69 had consistent responses, 36 changed their views (n=21, Table 3: footnote 10).

Results of the qualitative study
Deliberative group discussions about consent involved a group of 10 'more affluent' participants (Townsend score <-2, four female, mean age 61, SD 9.1 years), and a group of 11 'less affluent' participants (Townsend score >0, six female, mean age 66 years).
One-to-one interviews involved an additional 21 participants who received feedback about a potentially serious IF (13 'more affluent', 13 female, mean age 66 years). Analysis of the interview data revealed that participants were motivated to attend the imaging study by altruism, to experience MRI scanning firsthand (in case they needed to attend for investigations for a medical concern later in life), and to receive feedback about potentially serious IFs. Participants could not always recall precise details of the consent process with respect to feedback of IFs, but they were generally unconcerned about this as they trusted UK Biobank to act appropriately. One-to-one interviews further demonstrated that the implications of receiving feedback were not fully understood until after the event, that feedback resulted in shortterm anxiety, and that participants tended to assume the worst on receiving feedback; indeed, some were surprised that the final diagnosis might be non-serious, having anticipated a diagnosis of cancer, an aneurysm or a serious heart condition. Further details of the qualitative study results are available at http://www. ukbiobank.ac.uk/resources/.

Discussion
Compared to systematic review of images by radiologists, the UK Biobank IFs protocol (radiographer flagging) resulted in approximately 10-fold fewer participants with non-serious diagnoses (i.e., false positives), but missed 16/21 potentially serious IFs that were diagnosed ultimately as a serious disease (i.e. false negatives).
Extrapolation of our results to the 100,000 participants who will be imaged by UK Biobank over the next few years suggests that systematic radiologist review would generate 15,800 false positives, compared with 1,300 under the UK Biobank IF protocol (radiographer flagging), and would detect serious diagnoses in 2,100 participants compared with 500 under radiographer flagging ( Figure 2).
Systematic radiologist review in our study generated a prevalence of potentially serious IFs of 17.9%. The prevalence in other whole-body MRI studies of healthy populations ranged from 12.8% to 57.6% 17-20 . Since those studies used similar MRI sequences applied to similar tissue volumes, variations in prevalence are most likely to have arisen from differences in the definition of IFs, or in the age and other characteristics of the imaged populations.
Almost all participants with potentially serious IFs had subsequent clinical assessment, resulting in large numbers of investigations, referrals and procedures. Many of these were, with hindsight, unnecessary, with risk of direct harm as well as cost implications. Impact on emotional wellbeing, insurance or finances, and on work or daily activities were reported by a higher proportion of participants with serious final diagnoses, but affected a higher absolute number of participants without serious final diagnoses. In keeping with these results, over half of participants in the Study of Health in Pomerania who received feedback of an IF detected on whole-body MRI reported psychological distress 8 .
Only around one-third of our participants believed that participants should always be told about potentially serious IFs. Similar proportions of participants with serious and participants with non-serious final diagnoses expressed this opinion. However, almost a quarter of participants changed their opinion over the few months between the six-week and six-month questionnaires on whether participants should or should not be able to choose to receive feedback of an IF (Table 3: footnote 10), illustrating the complexities in interpreting opinions on this issue.
The findings of this study are of practical legal and ethical importance, and can be considered with regards to the duties of care, and the ethical principles of respect for autonomy, and beneficience and non-maleficence toward participants and towards the public. The legal and ethical background to UK Biobank's approach was developed with input from its Imaging Working Group, its independent Ethics and Governance Council, representatives of its major funders (Wellcome Trust and the Medical Research Council), UK Biobank's legal counsel and external legal counsel and ethics advice. In brief, it was considered likely that the duty of care owed to participants by radiographers would not be of a clinical standard, but rather what a reasonably competent radiographer conducting research imaging without clinical information could reasonable observe and report. This legal duty of care informs the ethical duties of radiographers, i.e., that they must be capable of meeting the standards of care which are detailed in the consent process. Therefore, in order to respect potential participants' autonomy, it is paramount that UK Biobank have an IFs protocol in place, and that this protocol and its limitations are explained to and understood by participants. Our results reinforce the need for clarity in the information provided to participants about the feedback policy before they consent to imaging research studies. While participants' understanding of what they had consented to was generally good, a substantial minority (around a quarter) incorrectly thought that they could choose whether or not to receive feedback. The information materials for the UK Biobank imaging study now further emphasize the difference between research and clinical diagnostic imaging, that the imaging is not a 'health check,' that not all serious disease will be detected, and that some potentially serious IFs will prove to be non-serious with further investigations (http://imaging.ukbiobank.ac.uk/). Considering the ethical principles of beneficence and non-maleficence toward both participants and the public, our data suggest that feeding back potentially serious IFs which turn out not to be serious (false positives) can make some participants worse off, through exposure to the inconvenience, worry and potential harms of clinical assessments, including invasive procedures. Feedback of false positives also results in wider harm through the unnecessary use of publicly-funded health services. Missing a serious disease (false negative) does not make participants worse off compared to their status before receiving feedback of a potentially serious IF; rather, it fails to make participants better off. While the literature about IFs sometimes argues that feedback is inevitably beneficial 21 , the balance of potential benefits and harms of earlier diagnosis (of IFs which are actually serious) is uncertain. It is important to reiterate that UK Biobank is a research resource which aims to facilitate research which will benefit public health, rather than provide any form of health services to individual participants. We therefore conclude that the responsibilities of researchers to avoid unnecessary harm to significant numbers of participants and disruption to publicly-funded health services mean that radiographer flagging (resulting in far fewer false positives while missing a small number of true positives with unclear benefit of earlier diagnosis) constitutes an ethically more justified approach in the UK Biobank imaging study than systematic radiologist review.
Some might argue that concerns about generating false positives suggest the case for a policy of no feedback of any IFs. However, in the light of legal advice regarding the duty of care it owed to participants as described above, UK Biobank decided not to withhold all feedback on potentially serious IFs, but to minimize the generation of false positives by only feeding back potentially serious IFs which are also confirmed by a radiologist. This approach to potentially serious IFs should be seen within the context of large-scale, population based imaging of healthy volunteers; a different approach may well be appropriate for other types of imaging studies, which may be smaller, based in clinical centres, have a different duty of care between research participants and researchers, or include participants with different characteristics (e.g., age) to those in the UK Biobank study.
While our underlying objective was to test the IFs protocol for the UK Biobank imaging study, our findings are of potential relevance in other contexts in which individuals are imaged prior to clinical presentation of disease, including public health and commercial screening. In both situations, it is important to consider the potential benefits of making a true positive diagnosis versus the potential harms to the individual and to publicly-funded health services, of a false positive diagnosis. The significant number of false positives generated by systematic radiologist reporting in our study implies that imaging of asymptomatic people should not be undertaken without appropriate concern for ensuring that the individuals being imaged do not end up worse off than they started.

Strengths
Our study is the first to systematically follow up all participants receiving feedback about IFs and their GPs, giving the most comprehensive data on the impact of feedback of potentially serious IFs in any research imaging study to date and providing the first quantitative comparison of two different protocols for handling IFs. We have demonstrated for the first time the much lower rates of potentially serious IFs and, most importantly, false positives detected with a protocol in which radiologists report only those images which radiographers flag as having potentially serious IFs. Although the public support the principle of providing feedback of IFs 14 , regardless of clinical severity 12 , most previous studies did not survey people who had actually received feedback. Our findings are crucial to informing future policy surrounding feedback of IFs in research studies.
Our study was strengthened by good questionnaire response rates and near complete data on final diagnoses due to extensive efforts to gather these directly from participants and their GPs, and data collection at both early and later time periods following feedback. Results related to understanding of consent and impact of feedback on participants were confirmed and contextualised in a parallel, qualitative study.

Limitations
Radiographer flagging rates could, in principle, have been influenced by a relative lack of experience with the first 1000 imaged participants, or by knowledge that radiologists were also reviewing all images. However, ongoing collection of data on potentially serious IFs in the 7000 participants imaged subsequently showed the prevalence of IFs detected by radiographers to be broadly consistent over time with a stable prevalence of potentially serious IFs confirmed by radiologists (mean proportion of 1.7%) (Supplementary File 10).
Although questionnaire response rates by participants were generally high, only around two thirds of participants' GPs responded about participants' emotional well-being and overall net benefit/harm. The design of the questionnaires did not allow for quantification of the use of particular health services or evaluation of the associated costs. However, UK Biobank continues to collect data from participants with potentially serious IFs and their GPs through questionnaires, supplemented by linkages to national health datasets. This will enable further clinical, health economic and policy issues to be addressed using data from larger numbers of imaged participants.
Classification of final diagnoses as serious or not was based on clinical judgement of data available up to around six months following feedback of a potentially serious IF. Final diagnoses classified as serious may not actually shorten life span, or substantially impact on major body functions or quality of life in the 21 participants concerned, who were apparently healthy at the time of their imaging visits. Some potentially serious IFs may take longer than six months to diagnose, or for their full impact to become clear, potentially leading to an incomplete picture of the adverse impacts of feedback.

Conclusions
The handling of potentially serious IFs merits serious consideration by researchers undertaking imaging research studies. Our data provide evidence to inform policy for large-scale research imaging in healthy populations, and are relevant to asymptomatic populations undergoing public health screening and commercial imaging. They demonstrate that systematic radiologist review of all images leads to the diagnosis of previously unknown serious disease in some participants. However, the great majority of these findings turn out not to be serious, resulting in unnecessary anxiety for the participant and unnecessary clinical assessment, which may include invasive procedures, provided by publiclyfunded health services. Further, for those participants whose IFs do turn out to be serious, it is often difficult to ascertain whether this knowledge results in clear clinical benefit.
There is no 'one-size-fits-all' approach to handling IFs, as much depends on the purpose of the imaging, be that research, screening, or clinical care. In research studies of healthy volunteers, for whom there is no direct benefit for taking part, it is particularly critical to minimise harm. Based on these results, we suggest that this is achieved in an imaging study of UK Biobank's scale and complexity with a protocol in which radiographers flag suspicious images for reporting by radiologists, rather than systematic review of all images by radiologists.

Data availability
Due to the confidential nature of questionnaire responses and clinical information on participants with potentially serious incidental findings, it is not possible to publicly share all of the data on which our analyses were based, but extensive summaries of all relevant data are included in the supplementary material and within the linked online material.
Importantly, any bona fide researcher can apply to use the UK Biobank resource, with no preferential or exclusive access, for health related research that is in the public interest. Application for access to UK Biobank data involves registration and application via the UK Biobank website, with applications considered by the UK Biobank Access Sub-Committee. Following approval, researchers and their institutions sign a Material Transfer Agreement and pay modest access charges. Further information on applying to access UK Biobank data is available at: http://www.ukbiobank.ac.uk/register-apply/. This much-awaited paper reports the experiences of UK Biobank, one of the largest research imaging efforts, with incidental findings. The results are of relevance to other research imaging groups around the world, and makes a valuable empirical contribution to evolving ethics and policy discussions on the management of incidental findings in research imaging contexts. A strength of this paper is that it monitors both the clinical impact and the psychosocial impact of the feedback of incidental findings on research participants. Also, the results of this study have been used to improve the informed consent process of UK Biobank (p. 12). The paper is nicely and clearly structured and comprehensive. I have three -not too major -concerns with this paper -The authors claim at several points in the text that e.g. "limited data exist" on the clinical and other implications of learning about incidental findings on research participants and that "data (...) are scarce" but would be much welcomed to inform the debate on appropriate protocols for handling incidental findings (page 3). Thus, the authors seem to suggest that their study is "the first" (page 12) to have looked into these implications empirically. This is not the case. The authors may have overlooked some of the available evidence and should either discuss this evidence or rephrase sections of the paper in which they suggest that there is little evidence.

Competing interests
-At times, the ethical argumentation falters a little. For instance, in the introduction the authors state that in research studies, potential harms should be minimised. This is correct, but a reference might clarify the scope and nature of this assumed obligation, as there are many different conceptions and interpretations of this obligation. Also on page 11, references are missing when the authors are discussing the principles of beneficence and non-maleficence and respect for autonomy. On page 12, it is argued that (the many) false positives (associated with systematic radiologist review) will make research participants (and society) worse off through unnecessary follow-up testing, while false negatives do not make participants worse off. I do not agree. False negatives can lead to false reassurance, which may pose health risks. The authors say that the participant information materials now explain more clearly how participation in UK Biobank does not constitute a health check (page 12). However, I am concerned that a subgroup of participants will still believe or expect their images to be reviewed for abnormalities, and will thus run the risk of false reassurance. Also, there is a difficulty that the harms associated with false positives are felt on a societal level (the costs and the efforts involved in (often unnecessary) follow-up), but not on an individual level: 97,7% of participants "reported being glad to be told about their potentially serious" incidental findings (p. 10). Thus, the authors thus slightly downplay the harms associated with false negatives and highlight the harms associated with false positives. Their conclusion that radiographer [1][2] negatives and highlight the harms associated with false positives. Their conclusion that radiographer flagging is better than systematic radiologist review (with a lower rate of false positives) does not come as a surprise, but may based on a -in my view -slightly skewed weighting of benefits and harms. However, I do agree with the authors that researchers' obligations are mostly to meet the requirements detailed in the informed consent process, and also that there are good pragmatic reasons for UK Biobank to opt for a radiographer flagging policy, and that this is acceptable as long as the consent process is careful and effective in conveying that images are not being checked for abnormalities.
-And a final question: on page 4, it is explained that "radiographers were trained in the relevant imaging protocols but did not receive specific training in image interpretation". In a paper that prof.dr. Meike Vernooij and I wrote some time ago , we argued that whether an incidental findings is detected (in this context: whether and what kinds of findings will be flagged) will depend upon various technical, social and organisational factors, including the training, message, or instructions given to the radiographers. For this reason, I am curious to know what was said/what is being said to the radiographers by the project leads (e.g. "If you see something, you should notify X. Do try not to see things. Remember, this is a research study, not a clinical setting. Check the images for quality only, try not to look at any potential abnormalities." or something very different). May be the authors can add one sentence to the section on the two protocols to explain e.g. whether or not radiographers were discouraged from noticing findings or any other relevant variables in the instructions given to radiographers. Providing these details to research participants as part of the consent process could also be a way of conveying to participants that the research imaging does not constitute a health check.
Overall, I support the indexing of this paper.

Are sufficient details of methods and analysis provided to allow replication by others? Yes
If applicable, is the statistical analysis and its interpretation appropriate? I cannot comment. A qualified statistician is required.
Are all the source data underlying the results available to ensure full reproducibility? We thank Dr Bunnik for taking the time to read our manuscript and for providing helpful comments on several aspects of our work.
In particular, thank you for suggesting that we add a reference to the paper De Boer et al. [1] We became aware of this work after our manuscript was sent out for initial peer review; we appreciate the need to update our text, and we have added the reference accordingly. Similarly, we were aware of the work of Bos et al. [2] as we are conducting a systematic review of the prevalence of incidental findings on brain and body imaging. [3] We state in our introduction that limited data ' exist on the impact of feedback of IFs on participants [4] and health services [5]' [6] with references to studies of the psychological [4] and economic impacts, [5] and we agree with Dr Bunnik that a reference to Bos et al. would be suitable here, and have added this to the text. However, despite this additional reference, we do think there remains very limited robust empirical data on the impact of feedback of IFs; while we do not provide a comprehensive review of the published evidence here, we hope to describe this in forthcoming manuscripts.
We appreciate that a large body of literature exists on the obligations of researchers to research participants, and on the ethical principles of beneficence, non-maleficence and autonomy. Following an initial peer review from Professor Bjorn Hoffman, we expanded on the particular ethical and legal background from which the UK Biobank IFs protocol was developed. We agree with Dr Bunnik that a lack of feedback of IFs may be misunderstood by some participants as false reassurance of health, and UK Biobank continue to evaluate participants' understanding of consent. UK Biobank does not use questionnaires of participants and their general practitioners to follow-up participants without potentially serious IFs to determine whether or not these represent 'false negatives.' As such, we feel that we do not downplay the harms of false negatives, but simply do not have the data to comment on these at present. We also agree with Dr Bunnik that the economic impact of false positive IFs constitutes an important harm, and while we do not present data here, it is the subject of a forthcoming manuscript. We hope that the data we do present here will contribute meaningfully to the ongoing discussion of the ethics of feedback of IFs, but feel that a more extensive discussion is beyond the scope of this current work.
We agree with Dr Bunnik that the training and instructions given to the radiographers would potentially impact on the prevalence of IFs. [7] UK Biobank trains radiographers to acquire research imaging data and perform quality checks of the images at the time of the scan. If the radiographers happen to notice something on the scan that they think could be potentially serious (either a finding happen to notice something on the scan that they think could be potentially serious (either a finding listed in Supplementary File 3, or a finding that meets the UK Biobank definition of potentially serious), then they are instructed to flag the images for review by a radiologist. The radiographers are not instructed to actively look for, or to avoid looking for IFs, rather, they are instructed that if they happen to notice a concerning finding, they should flag it for review.
We thank Dr Bunnik again for her review, and for stimulating an interesting discussion of aspects of our work.

2.
3. The paper is unique in that it quantifies the trade-offs between radiologists screening for incidental findings versus radiographers. The findings are not surprising -radiologists detect more true positives but also more false positives. The scale of the difference is surprising. The analysis is granular and the discussion is robust. The authors have anticipated many criticisms, and preemptively addressed them.
The paper would be strengthened by three additions: A comparison of the operating characteristics of radiologists and radiographers graphically. A tabulation of the serious incidental findings picked up by both groups. In particular, a clearer explanation of what the radiographers missed. A brief explanation of how they concluded that letting radiographers screen leads to less net harms -I get it, intuitively, but many might be tempted to argue, and since this is a key point, how the authors arrived at this conclusion should be better explained. An economic model isn't needed, but expansion of some examples would help.

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes We would like to thank Dr Jha for his comments.
Dr Jha suggested that we provide a comparison of the operating characteristics of radiologists and radiographers as a figure and we wondered if perhaps Dr Jha would like us to provide a figure showing a receiver operator characteristics curve? If so, we have deliberately chosen not to display such a figure, as it may give the misleading impression that systematic radiologist review of research images is a 'gold standard' protocol to which radiographers are being compared. Our article does not attempt to define one protocol as the 'gold standard' or 'best' protocol. Instead, we feel that there is no single 'best' protocol for handling PSIFs, rather, there will be more, and less, appropriate protocols depending on the imaging context. Our article therefore focuses on describing and weighing up the impacts, benefits and harms of each protocol in order to determine which is most appropriate to apply within the specific research context of the UK Biobank imaging study of 100,000 largely asymptomatic participants. We apologise if we have misunderstood Dr Jha's comment, and we would be more than happy to readdress this point if so.
The serious final diagnoses detected under each protocol are tabulated in Supplementary File 7. In brief, systematic radiologist review resulted in 21 serious final diagnoses. Radiographer flagging detected five of these 21 serious final diagnoses (one arachnoid cyst with hydrocephalus, one meningioma compressing brainstem, and three thoracic aortic aneurysms), and missed 16/21 (two pituitary tumours, two thoracic aortic aneurysms, three lung tumours, two cardiomyopathies, and one each of: atrial fibrillation, coronary heart disease, heart block with left ventricular impairment, abdominal aortic aneurysm, gastrointestinal stromal tumour, pancreatic neuroendocrine tumour, and an osteoporotic crush fracture). We have added this text to our results section. Dr Jha asked for further explanation about how we concluded that radiographer flagging resulted in less net harm compared to systematic radiologist review of all images. We elaborate on this in our response to a related comment made by Professor Hofmann, and we hope that our approach addresses Dr Jha's comments. This study investigates radiographer 'flagging' with radiologist confirmation of potentially serious incidental findings (IFs) compared with systematic radiologist review of images of brain, cardiac and body magnetic resonance, and dual-energy x-ray absorptiometry scans from the first 1000 imaged UK Biobank 1,2 1 2 magnetic resonance, and dual-energy x-ray absorptiometry scans from the first 1000 imaged UK Biobank participants. The study assessed the number and types of potentially serious IFs detected and their final clinical diagnoses. The study also includes a qualitative assessment of participants experience and understanding of participation and findings.
The study finds that radiographer flagging missed some serious diagnoses, but avoided adverse impacts for many participants with non-serious diagnoses, compared to systematic radiologist review. This makes the authors conclude that UK Biobank's responsibility to avoid both unnecessary harm to larger numbers of participants and burdening of publicly-funded health services suggests that radiographer flagging is a justifiable approach in the UK Biobank imaging study.
The study appears well conducted and is well reported. Figures and tables are informative and the manuscript is well structured. The findings are interesting and make new contributions to the field. This is a valuable study -also beyond the UK Biobank imaging study. In particular, data on final clinical diagnosis and the impact of feedback of IFs are scarce. The study is distinctive in assessing the number and types of potentially serious IFs detected and their final clinical diagnoses. It is also quite unique in investigating the impact of providing feedback about potentially serious IFs on participants, their friends, families and health services, with respect to factors such as: clinical assessments undertaken; emotional wellbeing, finances, work and daily activities; and participants' and their general practitioners' opinions about receiving feedback.
I have some detailed remarks, which hopefully can be helpful to the authors in improving the manuscript even further.
The study used a list of potentially serious Ifs (presented in a supplementary file), however, they do not discuss the inclusion criteria for this list. For instance, which criteria exist for severity, accuracy, and actionability for the various conditions? How does this relate to feedback of Ifs in other fields, e.g., ACMG's recommendations from 2013?
The reader may also want a discussion on why radiographers "did not receive specific training in image interpretation," and whether such training would alter the outcomes. Some indications are given (from the group's experience beyond the first 1000), but competency gained from formal directed training may be different from practical experience (based on volume). From the text one may infer that radiologists in both groups had access to data collected during the imaging visit (on the participant's age, sex, body mass index, self-reported smoking status, alcohol consumption, medical history and medications), but this is not completely clear. This can easily be made explicit.
The authors classified the final clinical diagnoses as serious if the findings were likely to significantly threaten lifespan or have a major impact on quality of life or major body functions of the research participants. It is unclear how "significantly threaten" is interpreted. Is it a risk score? How does it balance the severity of the event and its probability?
The authors' claim that it is "often assumed that early observation on imaging of presumed disease (prior to clinical presentation) is inevitably beneficial" has recently been confirmed in a systematic review of the literature .
It is not quite clear what is meant by: "We reconciled multiple responses on similar items from the three questionnaires by prioritising 'yes' responses and included data from coding of free text responses." Careful reading explains this, but the authors may want to help the reader here. 1 With regards to the UK Biobank lists of incidental findings (IFs) provided in Supplementary File 3, Professor Hofmann asked us to clarify criteria used to select IFs for this list, such as severity, accuracy and actionability, and how the list relates to feedback of IFs in other fields, such as the American College of Medical Genetics (ACMG) recommendations from 2013.
The lists of IFs deemed potentially serious (i.e. for feedback), and those deemed non-serious were developed after discussion with radiologists, other relevant imaging reporting specialists, radiographers, members of UK Biobank's Imaging Working Group, and with reference to work conducted by the German National Cohort (GNC) study. The GNC lists were developed specifically for the GNC imaging study, after review of the literature and discussion of best practice by radiologists familiar with the GNC research imaging sequences, and GNC ethical framework which aimed to feedback relevant findings, and not feedback irrelevant findings.
At the time of the development of the lists of IFs, there were limited empirical data available on the prevalence and types of IFs that could be expected on the types of imaging to be conducted by UK Biobank. Furthermore, the available studies differed in their definitions of IFs, some, but not all, of which included concepts such as severity and actionability within their definitions. Therefore, to further inform on the prevalence and types of IFs which may be expected on imaging conducted by UK Biobank, we conducted a systematic review of potentially serious incidental findings (PSIFs, as per the UK Biobank definition) on brain and body magnetic resonance imaging. We will report this work within a separate manuscript.
An ACMG working group generated a list of genetic mutations and recommended that these are sought out and reported when a laboratory performs any clinical exome or genome sequencing. In contrast, the UK Biobank lists of IFs are certainly not used as checklists to purposefully seek out, or exclude, specific types of IFs by either the radiographers or radiologists. Rather, when a radiographer happens to see something abnormal on a scan, during image acquisition or quality assurance checks, or when a radiologist is reviewing a flagged image, they can refer to the lists in conjunction with UK Biobank's definition of a potentially serious IF when judging whether any observed IF was potentially serious (i.e. for feedback to participants and their general practitioners [GPs]) or not.
To address Professor Hofmann's comment that, 'from the text one may infer that radiologists in both groups had access to data collected during the imaging visit (on the participant's age, sex, body mass index, self-reported smoking status, alcohol consumption, medical history and medications), but this is not completely clear,' we have amended the relevant text to improve clarity.
We classified final clinical diagnoses as serious if the findings were likely to significantly threaten lifespan or have a major impact on quality of life or major body functions of the research participants. Professor Hofmann asked how we interpret the term "significantly threaten." There is a paucity of empirical evidence on the natural history and final diagnoses of IFs, and to our knowledge there are no validated risk scores for quantitatively determining the risk to lifespan of particular IFs which are detected on research imaging. Our classification of final diagnoses as 'serious' is, as we mention in the limitations subsection of the discussion, a matter of clinical judgement. We also write that, as such, "'serious' final diagnoses may not actually shorten life span, or substantially impact on major body functions or quality of life in the 21 participants concerned, who were apparently healthy at the time of their imaging visits." Given this inherent subjectivity in the classification of serious final diagnoses, we measured the repeatability of the clinical judgements of final diagnoses severity, and demonstrated a very good level of agreement. subjectivity in the classification of serious final diagnoses, we measured the repeatability of the clinical judgements of final diagnoses severity, and demonstrated a very good level of agreement. Independently, a consultant physician and an experienced specialty clinical radiology trainee classified final diagnoses, and we report in our results section that these two doctors agreed in 172/179 (96.1%) cases, with the remaining seven cases easily resolved by discussion.
We state that "it is often assumed that early observation on imaging of presumed disease (prior to clinical presentation) is inevitably beneficial, but data on final clinical diagnosis and the impact of feedback of IFs are scarce." Professor Hofmann kindly directed us toward articles describing a surge in publications on early detection of disease, and a systematic review which demonstrates that some common screening tests are not associated with a reduction in either disease-specific or all-cause mortality. However, we have chosen not to add these references to the article for three reasons. Firstly, we wish to separate PSIFs (and IFs more generally) from the concept of early detection of disease, as our data demonstrate that the vast majority of PSIFs will not be finally diagnosed as a serious conditions, i.e. the majority do not represent early detection of disease. Secondly, we wish to keep separate the concepts of screening programs from protocols for handling IFs detected during research imaging; whilst data on the benefits and harms of screening programs may be generalizable to the context of PSIFs, screening purposefully for a particular disease using a validated test is a different context to the non-optimized demonstration of an abnormality (which may or may not represent disease) on research imaging, although we accept that the populations undergoing screening and population-based imaging research (i.e. asymptomatic people) are similar. Finally, whilst a discussion of our results in the context of screening, early detection of disease, and overdiagnosis is of great interest to us, as researchers and clinicians, we wish to keep this article focused in its scope.
With regards to our methods section, Professor Hofmann commented that, 'it is not quite clear what is meant by: "we reconciled multiple responses on similar items from the three questionnaires by prioritising 'yes' responses and included data from coding of free text responses." We would like to clarify this with examples. The two questionnaires sent to participants, and the questionnaire sent to their general practitioners (available online at: ) all http://www.ukbiobank.ac.uk/resources/ asked whether or not the participants had been referred to a specialist. The participant questionnaires have tick-box response options of 'yes,' 'no' and 'don't know.' The GP questionnaire is different, and asks the GP to tick a box if that action has been taken (i.e. no tick may represent 'no' or 'don't know'). In addition, there are multiple free text fields available on all three questionnaires. Therefore, multiple responses may be available about specialist referrals, depending on the return of the questionnaires, and the completion of tick boxes and free text spaces. We therefore had to reconcile these multiple responses, and decided to prioritise 'yes responses,' in order to generate a maximum count. For example, if a participant responded that they did not know if they had been referred, but their GP ticked that they had been referred, we prioritized the 'yes' response of the GP, and coded the participant as being referred to a specialist. Similarly, if a participant indicated on their six-month questionnaire that 'no,' they had not been referred to a specialist, but had previously indicated 'yes,' they had been referred on their six-week questionnaire (either by ticking the box, or mentioning a specialist appointment in free text, or both), we coded the participant has having been referred to a specialist. This methodology maximizes the counts of types of follow-up and impacts and makes use of the maximum amount of data available. We have added a description of this methodology to the end of the document containing the questionnaires, hosted at , and we have http://www.ukbiobank.ac.uk/resources/ added this link to the appropriate text in the methods section.
Professor Hofmann asked why radiographers did not receive specific training in image interpretation. The UK Biobank is a research resource, and as such, is not aiming to provide any 1 5 6 interpretation. The UK Biobank is a research resource, and as such, is not aiming to provide any form of individual health service, including image interpretation. Accurate image interpretation, even by radiologists, is difficult in any case within the context of UK Biobank, given the lack of clinical information on current symptoms or signs, and the non-diagnostic nature of the research imaging. This is clearly evident from our results: the vast majority of PSIFs detected by radiologists are finally diagnosed as non-serious disease. Within their typical roles in health services, radiographers are not trained to provide interpretation of cross-sectional imaging. Given the limitations of the research imaging, the difficulties in interpreting it (even by radiologists) and the typical role of the radiographers, rather than training radiographers to interpret multiple modalities of non-diagnostic cross-sectional imaging without any clinical information, UK Biobank opted instead to manage participants expectations of what could reasonably be expected. To this end, our consent materials state that the imaging is not a 'health check,' and lack of feedback does not constitute an 'all clear,' and we continue to evaluate participants' understanding of consent with regards to feedback of PSIFs.
Professor Hofmann stated that the ethical issues described in our article require some elaboration, including 1) some reflection on the relationship between legal and ethical considerations, 2) further explanation of our how we concluded that the return of IFs are warranted after considering "legal advice" and "duty of care," and 3) that principles other than non-maleficence, such as professional ethics, are relevant to our conclusions that radiographer flagging is the more appropriate IFs protocol in the context of the UK Biobank. Similarly, Dr Jha asked for further explanation about how we concluded that radiographer flagging resulted in less net harm compared to systematic radiologist review of all images.
We thank Professor Hofmann and Dr Jha for these comments, and agree with Professor Hofmann's further statement that while this article is not focused the ethics of IFs, these issues do need to be addressed and elaborated upon. UK Biobank have carefully considered the legal and ethical background with regards to feedback of PSIFs, and with input from its Imaging Working Group, its independent Ethics and Governance Council, representatives of its major funders (Wellcome Trust and the Medical Research Council), legal input from UK Biobank's legal counsel and from external legal counsel, and ethics advice from Professor Michael Parker of Ethox (who is also a co-author on this manuscript). Following the evaluation study, UK Biobank summarised the data on PSIFs and provided a detailed and lengthy interpretation of the results in the context of both the legal and ethical backgrounds in reports to their funders. Therefore, for readers' convenience, we have summarized the key points of these reports by adding concise sentences to the discussion text to further describe the legal advice, duty of care, the relationship between the legal and ethical considerations, and justification for our conclusions. We hope that this approach addresses both Professor Hofmann's and Dr Jha's comments.