Use of whole-genome sequencing to identify clusters of Shigella flexneri associated with sexual transmission in men who have sex with men in England: a validation study using linked behavioural data

Since the 1970s, shigellosis has been reported as a sexually transmissible infection, and in recent years, genomic data have revealed the breadth of Shigella spp. transmission among global networks of men who have sex with men (MSM). In 2015, Public Health England (PHE) introduced routine whole-genome sequencing (WGS) of Shigella spp. to identify transmission clusters. However, limited behavioural information for the cases hampers interpretation. We investigated whether WGS can distinguish between clusters representing sexual transmission in MSM and clusters representing community (non-sexual) transmission to inform infection control. WGS data for Shigella flexneri from August 2015 to July 2017 were aggregated into single linkage clusters based on SNP typing using a range of SNP distances (the standard for Shigella surveillance at PHE is 10 SNPs). Clusters were classified as ‘adult male’, ‘household’, ‘travel-associated’ or ‘community’ using routine demographic data submitted alongside laboratory cultures. From August 2015 to March 2017, PHE contacted those with shigellosis as part of routine public-health follow-up and collected exposure data on a structured questionnaire, which for the first time included questions about sexual identity and behaviour. The questionnaire data were used to determine whether clusters classified as ‘adult male’ represented likely sexual transmission between men, thereby validating the use of the SNP clustering tool for informing appropriate public-health responses. Overall, 1006  S . flexneri cases were reported, of which 563 clustered with at least one other case (10-SNP threshold). Linked questionnaire data were available for 106 clustered cases, of which 84.0 % belonged to an ‘adult male’ cluster. At the 10-SNP threshold, 95.1 % [95 % confidence interval (CI) 88.0–98.1%] of MSM belonged to an ‘adult male’ cluster, while 73.2 % (95 % CI 49.1–87.5%) of non-MSM belonged to a ‘community’ or ‘travel-associated’ cluster. At the 25-SNP threshold, all MSM (95 % CI 96.0–100%) belonged to an ‘adult male’ cluster and 77.8 % (95 % CI 59.2–89.4%) of non-MSM belonged to a ‘community’ or ‘travel-associated’ cluster. Within one phylogenetic clade of S. flexneri , 9 clusters were identified (7 ‘adult male’; 2 ‘community’) using a 10-SNP threshold, while a single ‘adult male’ cluster was identified using a 25-SNP threshold. Genotypic markers of azithromycin resistance were detected in 84.5 % (294/348) of ‘adult male’ cases and 20.9 % (9/43) of cases in other clusters (10-SNP threshold), the latter of which contained gay-identifying men who reported recent same-sex sexual contact. Our study suggests that SNP clustering can be used to identify Shigella clusters representing likely sexual transmission in MSM to inform infection control. Defining clusters requires a flexible approach in terms of genetic relatedness to ensure a clear understanding of underlying transmission networks.


DATA SummARy
fastq reads from all sequences can be found under Public Health England Pathogens BioProject PRJNA315192 at the National Center for Biotechnology Information Read Archive, available at https://www. ncbi. nlm. nih. gov/ bioproject/ 315192.

InTRoDuCTIon
Shigellosis or severe bacillary dysentery is caused by four Shigella species (Shigella flexneri, Shigella dysenteriae, Shigella sonnei and Shigella boydii) transmitted via the faecal-oral route of infection [1,2]. In England, shigellosis is often associated with travel to countries considered to be at 'high-risk' of enteric disease, primarily in South Asia or sub-Saharan Africa [1,3]. Over the last 10 years, however, there has been evidence of increasing transmission through sex between men [4,5]. This occurs through oral-anal sexual contact, and has been linked with specific sexual activities and drug-use behaviours among predominantly human immunodeficiency virus (HIV)-positive men who have sex with men (MSM) [6]. Our previous genomic studies have described MSM-associated Shigella species, and sub-lineages of those species, that show phenotypic and genotypic markers of high-level resistance to azithromycin (mphA and ermB), which is postulated to be linked to off-target effects from antibiotics prescribed for sexually transmitted infections [4,7,8].
The Gastrointestinal Bacteria Reference Unit (GBRU) at Public Health England (PHE) provides national microbiological reference services in England and Wales, including species identification and molecular typing for Shigella spp. Since August 2015, whole-genome sequencing (WGS) has been performed on all cultured isolates of Shigella spp. referred to the GBRU [9,10]. SNP typing is used to aggregate isolates into single linkage clusters considered to represent cases that are linked through recent transmission events. This provides a tool used for public-health decision-making in near real-time to distinguish between linked cases that might be part of an evolving UK outbreak, thus requiring a robust infection-control response, and cases who likely acquired their infection whilst travelling abroad [11]. Limited routine demographic data are submitted to the GBRU alongside laboratory isolates (see Methods) and are used to classify clusters, but these data lack information on sexual identity and behaviour.
In the UK, Shigella spp. are notifiable infections [12]. In 2015, as part of routine case follow-up and investigation, PHE introduced a new questionnaire to standardize and expand the collection of exposure information on suspected cases of shigellosis. In response to our previous work, and for the first time, this included questions relating to sexual identity and recent sexual behaviour. We combined questionnaire data with WGS SNP typing data for S. flexneri cases to understand the sensitivity and specificity of WGS cluster categorization at a range of SNP thresholds for distinguishing clusters representing sexual transmission within MSM networks from other clusters representing community (non-sexual) transmission.

S. flexneri isolates
All S. flexneri isolates referred to the GBRU by hospital laboratories in England between August 2015 and July 2017 were included in this study. Demographic data submitted to the GBRU alongside laboratory isolates (name, date of birth, sex, postcode of residence and foreign travel history) were extracted from the GastroData Warehouse, an isolate-level database that houses GBRU laboratory results for confirmatory testing and typing of gastrointestinal bacteria. Duplicate isolates belonging to the same individual within a 2 week period were excluded.

WGS and sequence analysis
Genomic DNA from bacterial isolates was extracted using the QiaSymphony DNA extraction platform (Qiagen). DNA was fragmented and tagged for multiplexing with Nextera XT DNA sample preparation kits (Illumina) and sequenced using the Illumina HiSeq 2500 platform at PHE. fastq reads from

Impact Statement
In recent years, genomic data have been used to describe the global spread of antimicrobial-resistant Shigella among large networks of men who have sex with men (MSM). To facilitate the detection of transmission clusters, Public Health England implemented whole-genome sequencing (WGS) for national surveillance of Shigella. SNP typing is used to facilitate cluster detection and ensure timely public-health action. However, interpretation is hindered by the lack of behavioural data. In this study, we use robust data on sexual identity and behaviour linked to individuals presenting with Shigella flexneri infections to validate a real-time SNP clustering algorithm. This is the first time that these data have been available for a large sub-set of cases in England. We demonstrate that SNP typing can be used to detect and distinguish clusters representing likely sexual transmission in MSM from other transmission clusters in the community. Our approach can be used to deliver a more timely and appropriate public-health response. Our results are relevant to any country developing and implementing WGS for public-health surveillance and outbreak detection.
all sequences can be found in the PHE Pathogens BioProject PRJNA315192 at the National Center for Biotechnology Information Read Archive: https://www. ncbi. nlm. nih. gov/ bioproject/ 315192. Illumina reads were mapped to the reference strain [S. flexneri serotype 2a strain 2457T (GenBank accession no. AE014073.1)] using bwa-mem [13]. The resulting Sequence Alignment Maps (sam files) were sorted and indexed to produce Binary Alignment Maps (bam files) using SAMtools [14]. High-quality variant positions (mapping quality >30, depth >10, variant ratio >0.9) identified using gatk v2.6 in unified genotyper mode [15] were extracted and stored in SnapperDB [11]. Hierarchical single linkage clustering was performed on the pairwise SNP distance matrix at descending SNP thresholds (250, 100, 50, 25, 10, 5 and 0), as previously described [11]. The clustering is summarized as a 'SNP address' (a seven-digit code) that describes the cluster membership at each of the thresholds. For phylogenetic analyses, recombinant regions of the genome were identified using Gubbins v2.0 and masked [16]. Pseudosequences of polymorphic positions were used to create a maximumlikelihood tree using RAxML v8.2.8 under the General Time Reversible model using up to 1000 bootstrap replicates [17]. Tree annotation was performed using Interactive tree of life (iTOL) v4.3 [18,19]. Antimicrobial-resistance determinants were detected using 'GeneFinder' [20] (https:// github. com/ phe-bioinformatics/ gene_ finder), a customized algorithm that utilizes Bowtie 2 [21] to map newly sequenced reads to a database of reference sequences, followed by SAMtools to create bam files [14]. Genes were defined as present if they represented 100 % of the reference sequence, with greater than 90 % nucleotide identity.

Cluster classification
A cluster was defined as two or more cases where the isolates differed in their pairwise SNP comparisons by a defined distance. A threshold of 10-SNPs difference across the coregenome sequences of any two isolates is the current standard for defining likely transmission clusters in routine publichealth surveillance of Shigella spp. at PHE. For this study, we performed the cluster analysis using a range of SNP thresholds (25, 10, 5 and 0) to test the sensitivity and specificity of these thresholds for identifying likely clusters that along with the submitted demographic data could be classified as either sexual or non-sexual transmission clusters.
Clusters were designated as 'adult male' (a pragmatic proxy for MSM transmission) if they comprised: (i) between two and five cases in total of which all were men aged 16 years or older, or (ii) more than five cases where at least 90 % were men aged 16 years or older. Clusters were designated as 'household' if two or more cases shared a living space (i.e. same postcode of residence, or had the same surname and were identified in the same Health Protection Team (HPT) area if the residential postcode was unavailable). HPTs are local PHE teams (21 in total) providing public-health advice and operational support for infectious disease outbreaks).
Clusters were designated 'travel-associated' if they contained two or more cases and at least 50 % of these cases had reported travel to the same country (or world region) outside the UK. 'Community' clusters consisted of cases that did not meet any of the other classifications, including clusters of between two and five cases where at least one case was a woman, and clusters of more than five cases where the proportion of men aged 16 years or older was less than 90 %. Differences in cluster size and duration between 'adult male' clusters and other clusters were assessed using the Chi-square test for comparing two proportions.

Standardized shigellosis exposure questionnaire
A standardized questionnaire was piloted from August 2015 to March 2017 by seven HPTs in England (three in London, four outside London) to follow-up cases of shigellosis (Fig. S1, available with the online version of this article). The questionnaire collected demographic information including sexual identity for individuals aged 18 years or older (heterosexual/ straight, gay/lesbian, bisexual, other, don't know/refuse to answer), sexual contact for adult men aged 18 years or older in the past 4 days [recent sexual contact (Yes/No) and if yes, was this with a man and/or a woman, or prefer not to answer], recent foreign travel (past 4 days), food and water consumption (past 4 days), clinical condition (date of onset, symptoms and hospitalization) and risk status (i.e. at increased risk of spreading the infection to others, such as individual being a food handler, healthcare worker or in contact with children aged 5 years or under [22].

Validation of SnP clustering
Questionnaire data were linked to WGS records extracted from the GastroData Warehouse using name, date of birth, sex and full postcode of residence. Data were analysed for all clusters that contained at least one individual with a completed questionnaire. The questionnaire data were used as the gold standard to validate whether the designation of a cluster as 'adult male' (i.e. likely sexual transmission between MSM) was accurate. Overall sensitivity and specificity of the cluster tool was assessed for each threshold using the questionnaire data as the gold standard to classify cases as either MSM (gay or bisexual-identifying men, or men who reported recent samesex sexual contact) or non-MSM (heterosexual men, women and children under the age of 18 years); men who did not report any information on sexual identity and recent sexual contact were excluded. We calculated the proportion of 'adult male' clustered cases that (i) self-identified as gay or bisexual men, (ii) reported recent same-sex sexual contact and (iii) reported recent foreign travel at a range of SNP thresholds.

Antimicrobial resistance
The proportion of cases with Shigella spp. WGS data that showed genotypic markers of azithromycin resistance (mphA and ermB) was explored for all 10-SNP clusters that contained at least one individual with a completed questionnaire. The questionnaire data were used to calculate what proportion of these cases were self-identifying gay men.

Description of WGS clusters
Between August 2015 and July 2017, there were 1006 referred S. flexneri isolates with WGS data in England, of which 563 aggregated into 92 single linkage clusters defined at the 10-SNP threshold cut-off. Most clusters were classified as 'community' (n=36) or 'adult male' (n=36), followed by 'travel-associated' (n=14) and 'household' (n=6). However, considering the overall number of cases across all clusters, then the 'adult male' clusters accounted for most of the case burden (68.9%, n=388), followed by 'community' (22.4%, n=126), 'travel-associated' (6.4%, n=36) and 'household' (2.3%, n=13). The median cluster size was 2 cases (range: 2 to 240 cases) and the median cluster duration (i.e. the time between the first and last reported cases) was 2 months (range: 1 day to 24 months).
' Adult male' clusters were generally larger; one-third (

Description of questionnaire data
S. flexneri questionnaires were available for 190 cases, representing 37.8 % (190/503) of all cases reported to GBRU from the participating HPTs (42.4 % in London, 28.3 % outside London) and 21.9 % (190/868) of all cases reported nationally during the pilot period. A total of 75.3 % (n=143/190) of questionnaires were submitted by London HPTs. Self-reported sexual identity and recent sexual contact data were available for 88.9 % (n=152/171) and 92.5 % (n=123/133) of individuals with questionnaire data, respectively.

Clusters with linked WGS and questionnaire data
Using a 10-SNP threshold, 34 single linkage clusters contained at least one individual with a completed questionnaire, representing 37.0 % (34/92) of all clusters that were detected nationally during the study period (Table 1). Of these clusters, 97.1 % (33) belonged to clonal complex (CC) 245, while 2.9 % (1) belonged to CC145. Combined they represented 22 'adult male' , 10 'community' and 2 'travel-associated' clusters. In total, these clusters contained 401 cases, representing 71.2 % (401/563) of all clustered cases nationally during the study period. Clusters classified as 'adult male' accounted for most of the cases (86.8 %, 348/401), followed by 'community' (10.7 %, 43/401) and 'travel-associated' (2.5 %, 10/401) clusters. At the 10-SNP threshold, 26.4 % (106/401) of clustered cases had their isolates linked to questionnaire data. The size and distribution of clusters varied depending on the SNP threshold. Table 1 presents a summary of clustered cases reported for each SNP threshold cut-off, and the proportion linked to a questionnaire. Table 2 presents the proportion of clustered cases for each SNP threshold cut-off by (i) sexual identity, (ii) recent sexual contact and (iii) recent foreign travel. For clusters at all SNP thresholds, there were adult men who did not provide information on sexual identity (for example, n=9 at the 10-SNP threshold) and/or recent sexual contact (n=7 at the 10-SNP threshold) that clustered with men identifying as gay. Overall sensitivity and specificity of the cluster classifications at a range of SNP thresholds are presented in Fig. 1 For some clusters, the classification changed when performing the validation analysis at different SNP thresholds providing a deeper understanding of likely transmission routes. Three 'community' clusters (two 10-SNP, one 5-SNP) were part of a larger 25-SNP 'adult male' cluster. These clusters did not meet the criteria to be defined as 'adult male'; for clusters comprising between two and five cases at least one case was female, and for clusters comprising more than 5 cases, less than 90 % were men aged 16 years or older (Table 3). All isolates in this 25-SNP 'adult male' cluster (highlighted in grey in Fig. 2) belonged to the same phylogenetic clade, for which 10-SNP clustering identified multiple discrete clusters (7 'adult male'; 2 'community') (Fig. S2). Overall, 74.2 % (69/93) of isolates (clustered and non-clustered) from gay-identifying men belonged to this clade (Fig. 2).

Genotypic markers of azithromycin resistance
Genotypic markers of azithromycin resistance were detected in the genomes linked to 84.5 % (294/348; 29 mphA, 11 ermB, and 254 mphA and ermB) of cases belonging to 'adult male' clusters and 20.9 % (9/43; all mphA and ermB) of cases belonging to 'community' clusters at the 10-SNP threshold (Fig. 2). Of those cases where a questionnaire was completed, 83.3 % (65/78) and 100 % (4/4) within 'adult male' and 'community' clusters, respectively, were gay-identifying men. The other 16.7 % (n=13) comprised five heterosexualidentifying men, two of whom reported recent sexual contact with a woman, and eight adult men who did not provide   Table 2. Sexual identity, recent sexual contact and foreign travel for clustered cases with a completed questionnaire

Adult male n=106
Community n=18 Travel n=3

Adult male n=89
Community n=15 Travel n=2

Adult male n=74
Community n=11 Travel n=2 Household n=2 Adult male n=25

Adult male n=106
Community n=18 Travel n=3

Adult male n=89
Community n=15 Travel n=2

Adult male n=74
Community n=11 Travel n=2 Household n=2 Adult male n=25 *Sexual identity not reported. †Denominator includes adult men (≥18 years) only. ‡Includes 2 men who did not disclose their sexual identity (i.e. their sexual identity is reported here as 'adult men') but who reported recent same-sex sexual contact. §Includes 1 man who did not disclose his sexual identity (i.e. his sexual identity is reported here as 'adult men') but who reported recent same-sex sexual contact. ||Foreign travel as recorded on the questionnaire; data missing for 1 case. Table 2. Continued information on sexual identity, although two of whom reported recent sexual contact with a man. The four gayidentifying men within the 'community' clusters belonged to the two 'community' clusters that were part of a larger 25-SNP 'adult male' cluster (see Table 3 for details of these clusters).

DISCuSSIon
This is the first time that robust data on sexual identity and behaviour have been available for a large sub-set of Shigella cases in England, and we have been able to utilize this data to validate a real-time SNP clustering algorithm. We provide evidence that SNP clustering using limited routine demographic data can distinguish clusters representing likely sexual transmission in MSM from other clusters representing community (non-sexual) transmission, thereby informing rapid and appropriate public-health responses.
The main limitation of this study is that behavioural data were only available for HPTs participating in a national pilot study. In our study, a higher proportion of clusters were 'adult male' compared to the national cluster data, and a higher proportion of all clustered cases belonged to these 'adult male' clusters. This bias towards 'adult male' cases may reflect the larger MSM populations within the participating HPTs, including London, Manchester and Brighton and Hove [23]. It is likely that a higher proportion of Shigella cases in these regions were part of 'adult male' clusters. As a result, our analysis may have overestimated the proportion of cases transmitted through sexual contact in MSM compared to other transmission routes. This could have biased our cluster validation analysis if the HPTs were more likely to follow-up MSM cases, or if the HPTs not included in the pilot were more likely to see 'adult male' clusters that were linked through community (non-sexual) transmission. It is also important to note that sexual identity and recent sexual contact were only collected for individuals aged 18 years or older, whereas the 'adult male' cluster classification used a cut-off of 16 years or older. The impact of this on our validation analysis is negligible due to the small number of individuals falling within this 2 year age gap.
Our study is restricted to cases who presented to healthcare, with a stool sample collected for investigation by local hospital laboratories and referral to the GBRU for molecular typing; approximately two-thirds of Shigella spp. isolated at local hospital laboratories are referred to the GBRU [24]. This under-ascertainment might have led to bias in our study, because a range of factors are likely to influence whether an isolate was included, such as patterns of health-seeking behaviour, testing practices and gender disparities in travel, food consumption or childcare [25]. A large population-based cohort study of infectious intestinal disease in England found that presenting to healthcare was associated with severity of illness, recent foreign travel, educational level and lower socioeconomic status [26]. Furthermore, an analysis of Shigella spp. surveillance data from the USA found that the odds of severe shigellosis were higher for men compared to women among adults aged 18-49 years old, and for men compared to women  *Ratio of males to females. †Denominator includes individuals with a questionnaire only. ‡Foreign travel as reported through questionnaires or laboratory records (for those without a questionnaire). §Adult man who did not report sexual identity and preferred not to disclose gender of partner.
among individuals diagnosed with S. flexneri. However, the role of sexual behaviour was not assessed [27]. Collectively, such factors are more likely to lead to an overestimation in the proportion of cases transmitted through sexual contact.
In our study, we confirmed that most cases belonged to 'adult male' clusters that predominantly consisted of MSM with no or unknown recent travel history. One dominant 'adult male' cluster persisted for 24 months, consistent with ongoing sexual exposure and potential for reinfection within large and dense sexual networks [6,[28][29][30]. Interestingly, up to 28 % of gay men within the 'adult male' clusters did not report recent sexual contact, which could suggest transmission through community (non-sexual) contact, a delay in symptom onset (due to an incubation period of longer than 4 days), relapse of a previously acquired chronic infection, or misreporting of recent behaviour. Of concern, and in contrast to other clusters, genotypic markers of azithromycin resistance were detected in most cases belonging to 'adult male' clusters. High-level azithromycin resistance in MSM transmission networks has been shown to play a role in driving the spread of Shigella infection [8], and this raises concerns about future treatment options for Shigella and the wider global problem of AMR.
Our findings may inform best practice for Shigella cluster investigations in any country developing and implementing WGS for surveillance and/or outbreak detection. However, this methodology is dependent on building a comprehensive library of contextual WGS data against which to classify new cases. Our analysis showed that real-time WGS of Shigella can be used to detect likely clusters of sexual transmission in MSM, but also showed that a flexible approach to SNP thresholds, or one that encompasses phylogenetic context, is important to fully understand the underlying transmission network and likely transmission route, particularly in clusters comprising fewer than 10 cases where at least one is female. In addition, persistent transmission networks sustained through sexual contact over longer periods of time will naturally exhibit greater genetic diversity, which should be considered when investigating clusters. In practice, restricting the SNP threshold to 10-SNPs may lead to subsequent investigation of clusters that appear to be part of different sexual networks, or the delivery of non-targeted public-health messages. In addition, community (non-sexual) transmission between MSM and non-MSM population groups, although currently limited to small numbers, should not be ignored.
GBRU typing data are used for routine surveillance and are currently assessed on a weekly basis for active clusters. The decision to further investigate a cluster depends on several factors, including cluster classification, duration, size and geographical spread. Cluster-specific interventions are not usually required for 'adult male' clusters that are geographically and temporally dispersed. In collaboration with partners including the British Association for Sexual Health and HIV (BASHH), the Terrence Higgins Trust (THT) and the LGBT Foundation, PHE provides public-health messages targeted towards MSM and clinicians that aim to raise awareness of Shigella and offer advice on what men can do to lower their risk (e.g. washing hands and genitals before and after sex, using condoms for risk practices and avoiding the use of shared sex toys) [31]. Despite this, it is not clear whether these public-health messages are effective. In 2016, a London-wide sexual-health promotion campaign focussed on raising awareness through social media, and through posters and leaflets displayed in sexual-health clinics. The campaign evaluation found that overall awareness among MSM remained low (29%) [32]. This study provides an additional route to provide appropriate and targeted healthcare advice to those who may not identify as gay or bisexual, or reveal same-sex sexual contact, but who are likely to be part of an MSM network and, therefore, at higher risk of acquiring shigellosis, as well as other sexually transmitted infections and HIV infection. Identification of sexual transmission is important to ensure prompt and onward referral to sexualhealth clinics.
The utility of SNP clustering for improving case ascertainment during outbreak investigations, and revealing linked cases not identified prior to the implementation of WGS, has been described previously [33][34][35]. However, detection and classification of WGS clusters should complement traditional epidemiological methods, including questionnaire completion, which can help to unravel the characteristics of the cluster and describe how transmission might be occurring. For example, epidemiological data may identify individuals who share a common source, such as attendance at a specific sex-on-premises venue, and these clusters may be amenable to public-health action including improvements in hygiene and the provision of condoms. The addition of epidemiological data to SNP clustering data may also reveal novel links between MSM and other populations in the community, particularly where MSM are within a known risk group (e.g. food handlers or healthcare workers). Sustained transmission in sexual networks could seed community acquired outbreaks, which is a concern given the high rates of antimicrobial resistance in the MSM population. Therefore, these clusters may be prioritized for public-health action.