Visual diagnosis of female genital schistosomiasis in Zambian women from hand-held colposcopy: agreement of expert image review and association with clinical symptoms

Background: Female genital schistosomiasis (FGS) can occur in S. haematobium infection and is caused by egg deposition in the genital tract. Confirming a diagnosis of FGS is challenging due to the lack of a diagnostic reference standard. A 2010 expert-led consensus meeting proposed visual inspection of the cervicovaginal mucosa as an adequate reference standard for FGS diagnosis. The agreement of expert human reviewers for visual-FGS has not been previously described. Methods: In two Zambian communities, non-menstruating, non-pregnant, sexually-active women aged 18-31 years participating in the HPTN 071 (PopART) Population-Cohort were enrolled in a cross-sectional study. Self-collected genital swabs and a urine specimen were collected at a home visit; trained midwives performed cervicovaginal lavage (CVL) and hand-held colposcopy at a clinic visit. S. haematobium eggs and circulating anodic antigen (CAA) were detected from urine. Two senior physicians served as expert reviewers and independently diagnosed visual-FGS as the presence of sandy patches, rubbery papules or abnormal blood vessels in cervicovaginal images obtained by hand-held colposcopy. PCR-FGS was defined as Schistosoma DNA detected by real-time PCR in any genital specimen (CVL or genital swab). Results: Of 527 women with cervicovaginal colposcopic images, 468/527 (88.8%) were deemed interpretable by Reviewer 1 and 417/527 (79.1%) by Reviewer 2. Visual-FGS was detected in 35.3% (165/468) of participants by expert review of colposcopic images by Reviewer 1 and in 63.6% (265/417) by Reviewer 2. Cohen’s kappa statistic for agreement between the two reviewers was 0.16, corresponding to "slight" agreement. The reviewers made concordant diagnoses in 38.7% (204/527) participants (100 negative, 104 positive) and discordant diagnoses in 31.8% (168/527) participants. Conclusions: The unexpectedly low level of correlation between expert reviewers highlights the imperfect nature of visual diagnosis for FGS based on cervicovaginal images. This finding is a call to action for improved point-of-care diagnostics for female genital schistosomiasis.


Introduction
Female genital schistosomiasis (FGS), primarily caused by S. haematobium infection, is a neglected tropical disease associated with poverty, inadequate sanitation, and limited access to safe drinking water 1,2 . FGS occurs when schistosome eggs destined for excretion via the urinary bladder are deposited in the female genital tract. These tissue-deposited eggs can be associated with characteristic genital mucosal lesions and can present with genitopelvic findings including contact bleeding 3 , abnormal vaginal discharge 4 , and in some cases, infertility 5 . Visual-FGS refers to the identification of these characteristic mucosal changes, such as sandy patches (grainy and homogeneous), rubbery papules, and abnormal blood vessels by visual inspection of the cervicovaginal mucosa 3 . The visual detection of FGS-associated lesions requires the insertion of a vaginal speculum, a good light source, and a lens providing adequate magnification 6 . A standard colposcope has traditionally been used in research settings for visual-FGS diagnosis, but the bilharzia and HIV (BILHIV) study demonstrated recently that hand-held colposcopy could also be used to decentralize colposcopy services 6-9 .
Confirming a diagnosis of FGS is challenging as there is not a widely accepted diagnostic reference standard for research, diagnosis, and screening 2 . A 2010 expert-led consensus meeting proposed visual inspection of the cervicovaginal mucosa as an adequate reference standard for FGS diagnosis 10 . However, the mucosal changes in visual-FGS are non-specific and have also been associated with herpes simplex virus, human papillomavirus (HPV) infection, and cervical precancer 3 . Diagnostic methods that are not adequately specific for FGS diagnosis may lead to over-treatment with praziquantel and may overlook the diagnosis and treatment of sexually transmitted infection (STI) and cervical cancer. Although there is little evidence of praziquantel resistance in humans 11 , indiscriminate treatment may theoretically increase the risk of the development of praziquantel resistance 12 . Since cervicovaginal visualization is widely promoted 13 for FGS screening and diagnosis, we aimed to use BILHIV study data to further evaluate the agreement of human expert reviewers for the diagnosis of visual-FGS. Secondary aims were to evaluate the association between visual-FGS and abdominal, genitourinary, and reproductive manifestations as well as evaluating Schistosoma diagnostic methods for their association with the presence of visual-FGS.

Study setting and participants
The cross-sectional bilharzia and HIV (BILHIV) study 9 was nested in the HPTN 071 (PopART) cluster randomized trial in Zambia 14 . The S. haematobium is endemic in Zambia, and while more data are needed, prevalences ranging between 14 -76% were reported in a recent systematic review 15 . The HPTN 071 (PopART) trial was a cluster randomized trial assessing the impact of an HIV-1 combination prevention package including "universal test and treat" 14 . As previously described, after the 36-month HPTN 071 (PopART) visit, community workers made home visits to women expressing interest in the BILHIV study 9 . Between January and August 2018, eligible women who were 18-31 years old, not pregnant, sexually active, and resident in one of two urban communities that participated in HPTN 071 (PopART) in Livingstone, Zambia were enrolled in the BILHIV study. The primary aim of the BILHIV study was to compare the performance of genital self-sampling (cervical and vaginal swabs) to clinic-based cervicovaginal lavage (CVL) for the detection of Schistosoma DNA by quantitative PCR (qPCR) as previously described 9 .
A specific pre-specified BILHIV study objective (the subject of the current manuscript) was to compare agreement of expert review of images obtained through hand-held colposcopy for the diagnosis of visual-FGS.
Home and clinic-based sample collection As previously described, the home visit included written informed consent, a questionnaire, genital self-sampling (cervical and vaginal), and collection of a urine specimen 9 . There were no restrictions on the timing of urine self-sample collection, and 69.5% (419/603) of the total BILHIV study samples were performed between 9:00 and 14:00 9 . Enrolled women who were not currently menstruating were then invited to attend Livingstone Central Hospital cervical cancer clinic, where midwives collected CVL. After speculum insertion, normal saline (10 mL) was flushed across the cervix and vaginal walls for one minute with a bulb syringe and CVL fluid was collected from the posterior fornices.
Hand-held colposcopy and image review At the clinic, cervicovaginal images were captured with a portable colposcope (EVA System, MobileODT, Tel Aviv, Israel) according to a standardized protocol. Per the protocol, trained midwives evaluated the cervix, anterior fornix, posterior fornix, left and right lateral cervix and vaginal walls and captured images of each location using the zoom and lighting functions in the Mobile ODT colposcope. Two senior physicians who have training and expertise in colposcopy and FGS served as expert reviewers. Digital images were independently evaluated by the expert reviewers for any of the four recognized FGS cervicovaginal manifestations: grainy sandy patches, homogenous yellow sandy patches, rubbery papules, and abnormal blood vessels 16 . At their discretion, expert reviewers could exclude images that they felt could not be evaluated due to technical issues, image quality, or limited cervical visualization. If any of the four recognized FGS

Amendments from Version 1
The revised version of this manuscript states the study objectives more clearly in the introduction. In response to reviewers, the authors also provide additional information regarding how the subset of participants tested for sexually transmitted infection were chosen and the standard protocol the midwives used to capture cervicovaginal images using hand-held colposcopy.
Any further responses from the reviewers can be found at the end of the article REVISED cervicovaginal manifestations was present, the participant was categorized as "visual-FGS". If none of the four cervicovaginal manifestations were present the participant was categorized as "visual-FGS not detected" 16 . The expert reviewers were both senior practicing physicians at the Professor level, who have training and expertise in standard colposcopy. Reviewer 1 (EFK) is full-time FGS researcher and an infectious diseases physician and Reviewer 2 (BV) is an obstetrician and gynecologist who regularly analyses images for cervical cancer. Both reviewers have extensive practical and research-based expertise in evaluating and diagnosing FGS in endemic settings. Additionally, both reviewers contributed as authors of the 2015 WHO FGS Pocket Atlas 16 . Each reviewer was informed of the study setting and methods, but both were blinded to the study participants' FGS and Schistosoma status.
Women with at least one of the visual manifestations of FGS 3,16 or with any positive urine or genital Schistosoma diagnostic result were treated free-of-charge with 40 mg/kg praziquantel. Testing for STIs was not performed at the point-of-care and participants with suspected STIs were offered syndromic management, as per local guidelines 17 . In line with national and local clinic protocols adapted to real-world resource limitations, human papillomavirus (HPV) testing was not performed.
In parallel with BILHIV study procedures, participants could choose to engage in free cervical cancer screening using the visual inspection with acetic acid (VIA) technique. In the subset of women who engaged in cervical cancer screening, midwives applied 3-5% acetic acid to the cervix after CVL collection, as previously described 18 . An opaque white reaction was classified as positive and no change as negative 19 . Images for FGS analyses were taken before application of acetic acid. Images for cervical cancer screeing were taken after application of acetic acid.
Urine microscopy, and circulating anodic antigen Up to 60mL of fresh urine was centrifuged and examined by microscopy for S. haematobium eggs. The participant was considered to have urinary schistosomiasis if a pellet contained at least one S. haematobium egg 9 . All study specimens were stored at -80°C. A lateral flow assay utilizing up-converting reporter particles for the quantification of circulating anodic antigen (CAA) was performed on urine samples, as previously described 9,20 . Analyzing the equivalent of 417 μL urine (wet reagent, UCAAhT417), a test result indicating a CAA value >0.6 pg/mL was considered positive 21 .
qPCR for detection of Schistosoma DNA DNA extraction, amplification and detection of the Schistosomaspecific internal-transcribed-spacer-2 (ITS-2) target by realtime (qPCR) was performed at Leiden University Medical Center, as previously described, using 200 μL of CVL, cervical or vaginal swab fluid 9,22 .

Statistical methods
The planned sample size of the BILHIV study was based on calculations related to the primary BILHIV study objective, as previously described 9 . Participant characteristics were summarized by median and interquartile range (IQR) for continuous variables, and by frequency and percentage for categorical variables. Participants missing data for a specific variable were excluded from analysis involving that variable. The primary analysis evaluated the agreement between the two expert reviewers using Cohen's kappa statistic. A secondary analysis evaluated the association between visual-FGS (exposure) and abdominal, genitourinary, and reproductive manifestations (outcomes). Crude associations were evaluated using chi-squared tests, and logistic regression was used to calculate crude and adjusted odds ratios (OR) for the association of visual-FGS with clinical manifestations; this was done separately for each expert reviewer's diagnosis of visual-FGS. In this study we employed various diagnostic tests to evaluate urinary Schistosoma infection (CAA and urine microscopy), and FGS (portable colposcopy, and Schistosoma DNA on CVL and genital swabs) as previously described [23][24][25] . Another secondary analysis evaluated each diagnostic method for its association with the presence of visual-FGS, separately for each expert reviewer.
Due to small numbers, for evaluating the association of visual-FGS with PCR-FGS, we used a composite definition of PCR-FGS or "any positive genital PCR", defined as any positive cervical or vaginal swab or CVL specimen. Chi-squared tests were used to assess crude associations, and logistic regression was used to calculate crude and age-adjusted odds ratio (OR) of the various Schistosoma and FGS diagnostics with the presence or absence of visual-FGS. We were unable to adjust for other potential confounders due to small numbers, particularly for STI and cervical pre-cancer status which were collected on a sub-set of participants. For both secondary analyses, exact logistic regression was used for analyses where 5 or fewer participants in a particular exposure category had the outcome. Due to the exploratory nature of these analyses, we did not adjust for multiple comparisons. Data were analyzed using STATA 15.1 (Stata Corporation, College Station, TX).

Baseline characteristics and demographics
The BILHIV study enrolled 603 eligible women, 527 (87.4%) of whom had cervicovaginal images captured by portable colposcopy. Of the 527 women with images, 468 (88.8%) were deemed interpretable by Reviewer 1 and 418 (79.3%) by Reviewer 2 (Figure 1). Each reviewer designated a proportion of images uninterpretable, leading to differences in denominators. The median age of the participants was 24 years (range 22 -28) and 323 (61.3%) had attended some secondary school ( Table 1). The majority of participants were married, had previously been pregnant, and had been sexually active within the last six months. There was no association between visual FGS, as identified by any expert review, and current or childhood water contact.

Visual FGS and Schistosoma laboratory tests
Of the 527 participants, 6.1% (32/527) had urinary S. haematobium infection, as diagnosed by urine microscopy, and 14.9% (78/525) had a detectable urine CAA. There was no association between S. haematobium egg-positive urine microscopy or urine CAA and visual-FGS, as defined by Reviewer 1 or Reviewer 2's assessment (Table 3).

Symptoms
The association between abdominal, genitourinary, and reproductive manifestations and visual-FGS is shown in  (Table 4).

Discussion
Diagnostics for neglected tropical diseases should be accurate, accessible, and affordable, with specimen collection that is easy 26 . Making a diagnosis of FGS is challenging as there is currently not a widely accessible, sensitive and non-invasive reference standard for either diagnosis or screening which confirms Schistosoma genital involvement at the point-of-care. In a 2010 expert-led consensus meeting, visual imaging of the vagina and cervix with photocolposcopic methods was proposed as an adequate reference standard for FGS visual diagnosis 27 . Imaging is currently the only widely available point-of-care diagnostic tool for FGS diagnosis outside of the research setting and the BILHIV study sought to use handheld colposcopy to enable community-based FGS diagnosis 9 . Visual imaging can be useful in the assessment of Schistosomarelated morbidity, praziquantel treatment response, and defining the natural history of visual-FGS. Additionally, hand-held and traditional colposcopy have the logistical advantage that they can be integrated with existing cervical cancer screening programmes 28 . However, visual imaging has important limitations. Firstly, interpretation of visual imaging is subjective. Secondly, visual imaging lacks specificity as the characteristic sandy patches can also be associated with STI and the abnormal blood vessels can also be associated with cervical precancer 3 . This study shows "slight" agreement between senior, highly experienced expert reviewers, highlighting the imperfect nature of human expert review of images for FGS.
Visual FGS-diagnosis is a widely accepted diagnostic tool for evaluating Schistosoma-associated genital morbidity. However, visual-FGS screening is often centralized in settings with access to traditional colposcopy and is invasive, requiring vaginal speculum insertion and trained medical professionals (physicians, nurses, or midwives) to visualize the cervix and vagina at high resolution 9 . Additionally, visual-FGS diagnosis requires a full inspection of the mucosal surfaces of the vagina and cervix. If metal specula are used, post-examination autoclaving and appropriate disinfection further constrains the settings in which this diagnostic strategy can be seamlessly implemented. Disposable specula have risks and benefits. While hygienic and convenient, disposable plastic specula may not be sturdy enough when rotated to inspect the anterior and posterior vaginal walls and may contribute to missed visual-FGS diagnoses 6 . A good light source is needed for optimal cervicovaginal visualization 6 , as well as a device which can provide adequate magnification, ideally a colposcope, hand-held colposcope, or digital camera 8 . Thus, colposcopy, whether hand-held or traditional, for visual-FGS diagnosis is not readily scalable for use as a population-based screening technique.
In this current work, without complete STI and HPV testing or cervicovaginal biopsy on each participant, it is challenging to assess the significance of the sandy patches and abnormal blood vessels identified by the clinical expert reviewers. Notably, researchers in Tanzania performed macroscopic cervicovaginal examinations comparing S. haematobium endemic and non-endemic areas, finding 75% of participants in endemic areas had cervical lesions (including sandy patches, edema, erosions and petechiae) compared with 36% of women in non-endemic areas (although their travel and medical history were not described) 29 . The Tanzanian study illustrates the limited specificity of visual techniques, since one-third of the women had cervical lesions in communities where S. haematobium is not endemic.
Other diagnostic approaches such as PCR-based methods, have been implemented in research settings but are not yet field-deployable 9 . Antigen, antibody, and pathogen-based diagnostics (such as microscopy) are useful diagnostic adjuncts for Schistosoma infection, but do not confirm the involvement of genital tissue. Future diagnostic algorithms may be optimized by first performing a microbiologic S. haematobium diagnosis prior to performing screening for genital involvement 9 . Promising pathogen detection strategies that can be implemented at the point-of-care include isothermal DNA amplification methods 30,31 . These field-deployable molecular assays should be further developed for use at the point-of-care to identify Schistosoma DNA in self-collected genital swabs 31 .
Our study did not show a consistent association between expert diagnosis of visual-FGS and abdominal, genitourinary and reproductive symptoms. Reviewer 1's evaluation suggested an association between self-reported delay in conception and visual-FGS and Reviewer 2's evaluation suggested a weak association with hematuria and dyspareunia in participants with visual-FGS. A retrospective study from Tanzania evaluating histopathology reported tubal schistosomiasis in 4 patients reporting with infertility 32 and a cross-sectional study from Zimbabwe found strong evidence that the presence of S. haematobium in pap smear was associated with infertility in women aged 20 -49 years, after adjusting for age and HIV status 5 . While alluring to consider the association of delayed conception identified by one reviewer with visual-FGS in isolation, the association would have been strengthened by consistency of the findings across reviewers. Additionally, in interpreting this result, it is important to consider the possibility of a type 1 error when large numbers of statistical tests are performed.
Previous work on visual-FGS has compared visual imaging to other diagnostic standards 33 or have used computerized algorithms 34,35 , or a combination of human reviewers and a digital gridding technique to evaluate visual-FGS 7 . A recent Madagascan study utilized human reviewers together with a digital image gridding technique to review images of women with known FGS-associated clinical lesions and found a Fleiss kappa of 0.55 ("moderate" agreement) for detecting rubbery papules. Reviewers in that study achieved a higher agreement than that described in our study, potentially by undergoing an initial consensus rating exercise to reach agreement on uniform rating of images. Our approach in the BILHIV study illustrates a real-world scenario where expert reviewers may not necessarily have the opportunity for consensus agreement prior to consultation. This is the first study to assess the agreement of human expert reviewers for diagnosing visual-FGS with hand-held colposcopy, where both reviewers were blinded to the participants' FGS and Schistosoma diagnostic status. In this study, both expert reviewers are experienced clinical Professors who have expertise in diagnosing FGS in endemic settings and contributed as authors to the 2015 WHO FGS Pocket Atlas 16 .
While our approach is unique, this work has some limitations. The prevalence of urinary schistosomiasis and PCR-FGS were low, thus limiting precision in effect sizes and power to detect association when comparing PCR-FGS and urinary schistosomasis with visual-FGS. Additionally, the urban setting, relatively narrow age range of the participants and low urinary S. haematobium prevalence may limit generalizability. Future additional work in a setting with higher schistosomiasis prevalence would be needed to definitively exclude an association between symptoms, standard Schistosoma, and FGS diagnostics and visual-FGS. To replicate real-world conditions, standardized equipment on which to perform image review was not provided to reviewers. Thus, we cannot exclude that differences in color, brightness, contrast, or saturation of images on the reviewers' computers contributed to differences between reviewers. Additionally, future work could incorporate artificial intelligence, such as computer algorithms to detect the characteristic color change caused by involvement of the genital mucosa with FGS 35 or the use of digital gridding techniques 7 . Additionally, a initial consensus rating exercise could be incorporated into future work with human expert review for FGS-associated lesions. The presence or absence of the specific FGS lesion (sandy patch, rubbery papule, abnormal blood vessels) was not consistently documented along with the presence or absence of visual-FGS, limiting analysis by lesion type. Study participants self-reported their time-to-conception status, thus results may be subject to recall bias. STI testing was only performed on a subset of the study population and visual inspection with acetic acid data were not obtained within the BILHIV study, thus data on these variables are incomplete 18 . Without complete STI and HPV testing or cervicovaginal biopsy on each participant, it is challenging to assess the significance of the sandy patches and abnormal blood vessels identified by the clinical expert reviewers. Thus, we cannot exclude residual or unmeasured confounding.
In conclusion, with only "slight" agreement between experienced expert reviewers who identified visual-FGS from digital images obtained during point-of-care colposcopy, we suggest caution when visual imaging is used as a stand-alone FGS diagnostic. Comments on format: Please define CVL in the abstract before using the accronym.
○ I would quickly define in the abstract what is meant by "expert reviewers", and would also define it sooner in the methods in the main text.
○ Comments on scientific content: I am missing more in-depth description and discussion of the discordance in visual diagnosis between the two expert reviewers. a) I noticed that the discordance was already quite large for the number of images that could not be interpreted (36 vs 106). What does "images inaccessible" in Figure 1 mean? Could you explain this discordance? ○ b) Regarding the discordance in visual diagnosis of FGS: I understand that the presence or absence of a specific FGS lesion was not systematically recorded but did you notice a systematicity in the discordance? Would this discordance have been solved in the majority of cases after discussions between the two expert reviewers? ○ c) Does the conclusion of "diagnostic uncertain" (n=23) of reviewer 1 overlap with the images discarded by reviewer 2 for "poor cervical visibility"? ○ d) I would suggest to add pictures, if allowed by the journal, of images leading to concordant positive, concordant negative and discordant diagnoses.

○
It seems that in a number of participants, visual-FGS was "positive" whereas PCR/CAA/Microscopy were negative. In addition to the fact that the lesions thought during visual inspection can be nonspecific, could it also be that visual-FGS in some cases is rather the sign of past infection? And in that case can the authors discuss the added-value of visual-FGS diagnostic?

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? "Comments on the scientific section: I am missing a more in-depth description and discussion of the discordance in visual diagnosis between the two expert reviewers. -a) I noticed that the discordance was already quite large for the number of images that could not be interpreted (36 vs 106). What does images inaccessible in Figure 1 mean? Could you explain this discordance?" Thank you for the opportunity to clarify. In lines 156-157 we explain that, at their discretion, expert reviewers could exclude images that they felt could not be evaluated due to technical issues, image quality, or limited cervical visualization. This number was 36 for Reviewer 1 and 106 for Reviewer 2. Reviewer 1 had difficulty opening 10 of the images. In Figure 1 'images inaccessible' has been changed to "technical difficulty opening images".
"-b) Regarding the discordance in visual diagnosis of FGS: I understand that the presence or absence of a specific FGS lesion was not systematically recorded but did you notice a systematicity in the discordance? Would this discordance have been solved in the majority of cases after discussions between the two expert reviewers?" We did not observe a systematicity in the discordance. It is difficult to say whether a discussion between experts could have resolved the discordance. We did not include an initial consensus rating agreement in our study as we wanted to simulate real-world conditions in our study, where consultant physicians may not have time to perform a consensus review on a large number of images. We did note in the discussion at lines 407-408 that an initial consensus rating exercise could be incorporated into future work with human expert review for FGS-associated lesions.
"-c) Does the conclusion of "diagnostic uncertain" (n=23) of reviewer 1 overlap with the images discarded by reviewer 2 for "poor cervical visibility"?" Thank you for this question. The images discarded by Reviewer 2 seem to be rejected at random as there were diagnoses of FGS negative, sandy patches and abnormal blood vessels among the discarded images. Within these images there were a minority (n=6) that were also rejected by Reviewer 1.
"-d) I would suggest to add pictures, if allowed by the journal, of images leading to concordant positive, concordant negative and discordant diagnoses." Wellcome Open Research specifies that "Any photographs must be accompanied by written consent to publish from the individuals involved". We did not obtain consent from the participants to publish cervicovaginal images so unfortunately cannot include these.
"It seems that in a number of participants, visual-FGS was "positive" whereas PCR/CAA/miscropscopy were negative. In addition to the fact that lesions though during the visual inspection can be non-specific, could it also be that visual-FGS in some cases is rather the sign of past-infection? And in that case can the authors discuss the added value of visual-FGS diagnostic?" There are significant challenges to accurate FGS diagnosis relating to lack of a consensus method, equipment availability, technical expertise, and stigma surrounding women's reproductive health. While the proposed reference-standard for FGS diagnosis is visual inspection by colposcopy (visual-FGS), this method is subjective to expert interpretation and can be further confounded by some sexually transmitted infections. The authors report secondary findings of the cross-sectional bilharzia and HIV (BILHIV) study, in which, N=527 women from Zambia were enrolled; provided urine, cervical and vaginal self-sampled tissue swabs; and underwent clinical cervicovaginal lavage sampling and portable colposcope examination and imaging. The primary aim was to evaluate the agreement of visual-FGS by two independent experts, and the secondary aims were to compare visual-FGS with other FGS and S. haematobium diagnostic techniques as well as self-reported FGS sequelae. While the proportion of FGS cases-by any diagnostic method-was relatively low in this cohort, the authors report only slight agreement between the two visual-FGS experts. These results further emphasize the challenges associated with FGS diagnosis and the need for more reliable and accessible diagnostic methods. This study provides an important contribution to the FGS literature and calls attention to barriers to providing reproductive health care for African women.

Introduction:
The authors should provide some background epidemiological info on burden of FGS among women and girls in Africa, associated morbidities (relating to both physical and mental health as some of these are a feature of the secondary analysis), and emphasize that cases are underdiagnosed due to a myriad of reasons (equipment, expertise, access to women's health services, etc). If available, include any info on the burden of urogenital schistosomiasis in Zambia or offer context for why Zambia was chosen as the study site.

1.
Please describe techniques for identifying FGS in the introduction (biopsy of genital tissue, visual inspection by colposcopy, PCR performed on cervicovaginal lavage material) and include the advantages and disadvantages of each.

Amy Sturt
"Introduction -The authors should provide some background epidemiological info on burden of FGS among women and girls in Africa, associated morbidities (relating to both physical and mental health as some of these are a feature of the secondary analysis), and emphasize that cases are underdiagnosed due to a myriad of reasons (equipment, expertise, access to womens health services, etc). If available, include any info on burden of urogenital schistosomiasis in Zambia or offer context for why Zambia was chosen as a study site." Thank you for your input. "3) It would be helpful to clarify in the introduction that this investigation reports findings of the BILHIV cohort study." Thank you for this input, this is now clearly stated in the introduction at line 113.
"4) Suggestion to state the primary and two secondary objectives of the present study in the introduction section." Thank you for this suggestion. The primary and secondary objectives of the analysis are now stated in the introduction from lines 114 -117.

Vanessa Christinet ASCRES, Lausanne, Switzerland
Thank you for inviting me to review this article on a comparative analysis of visual diagnosis of FGS between two experts in visual diagnosis of cervical images. The authors chose to explore the relationship between the diagnosis based on image analysis of the two reviewers and several objective measurements of S. Heamatobium infection diagnosis and also with symptoms described by the patients.
This study and its results are very relevant and interesting as there is little produced in this field of research. It is indeed very important to evaluate diagnosis by visual inspection of the cervix as this is the recommended method of reference but there is a lack of data on the reliability of the diagnosis.
As this study is nested within the HPTN071 study, I would recommend to briefly describe the HPTN071 study in a single sentence, even if it has already been described in other publications, in order to fully understand the profile of the participants and the general context of the study.
Indeed, the question of HIV diagnosis and the level of immunosuppression (if known) should be addressed as it is certainly an important confounding factor for the interpretation of the visualized lesions and could induce an important heterogeneity of cervical lesions making them more difficult to interpret (association between HIV and FGS, association between HIV and other STIs, association between HIV-HPV-dysplasia).
In the description of the method, it is stated that STIs were not screened for POC but later in the text it is written that a whole panel of STIs was tested. Considering that the authors state that microbiological STI diagnosis can be a confounding factor in the diagnosis and symptoms associated with FGS, it seems important to me that they are taken into account in the analysis. As this is not the case here, it would be relevant to justify why.
Cervical cancer screening with visualization of dysplasia is described in the methodology but is not included in the comparative analysis between the two reviewers despite the fact that it is described as a confounder in terms of vascularization.
In the statistical analysis section it is described that the OR was adjusted for age. I would like the authors to describe the rationale for this adjustment and why other confounding factors were not included in this adjustment (STI, HIV, cervical dysplasia). Age seems to me to be relevant for the adjustment of the association with delay of conception, but I do not understand the interest of using it for the other variables (it should be noted that the unadjusted and adjusted ORs are not very different, which speaks to a moderate interest for the inclusion of this adjustment factor).
In the method, additional detail on the number of photos per patient sent to the reviewers would be an enhancement as this could give more information on what they based their evaluation on. It would also be interesting to have more detailed information on the quality of the images to help understand the discrepancy between the two reviewers' analysis. Was a precise description of the lesions made by each of the reviewers which would allow a better understanding of the level of disagreement?
It is noteworthy that although not reaching statistical significance, both reviewers consistently identified a greater proportion of FGS lesions in individuals with objective markers of schistosome infection. If the sample size had been larger, statistical significance would probably have been reached for some variables. If the data are available, I think it would be very interesting to exploit the STI diagnoses here to see how they probably contributed to the heterogeneity of the analysis.
I find the presentation of the results not very easy to understand. The authors refer several times to the non-association without describing precisely the significance of the analysis which is actually the lack of a statistically significant difference between patients diagnosed by positive or negative visual inspection by the reviewers in relation to objective parameters of schistosome infection. Again, it should be noted that even if the differences are not statistically significant, the ORs are still positive in favour of the objective measurements of schistosomiasis. I suggest rewording the text below Table 3.
It is interesting to note that the symptom with the highest OR for both reviewers is for the association with haematuria, which is the most specific symptom of Schistosoma Heamatobium infection.

If applicable, is the statistical analysis and its interpretation appropriate?
Partly Are all the source data underlying the results available to ensure full reproducibility? Yes

Are the conclusions drawn adequately supported by the results? Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Sexual health, HIV, communicable disease, tropical medicine, public health, epidemiology.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
Author Response 05 Apr 2023

Amy Sturt
"As this study is nested within the HPTN study, I would recommend to briefly describe the HPTN study in a single sentence, even if it has already been described in other publications, in order to fully understand the profile of the participants and the general context of the study." Thank you for this feedback. The HPTN 071 (PopART) trial was a cluster randomized trial assessing the impact of an HIV-1 combination prevention package including "universal testing and treatment". We have added this information in lines 124-125.
"Indeed, the question of HIV diagnosis and the level of immunosuppression (if known) should be addressed as it is certainly a confounding factor for the interpretation of the visualized lesions and could induce an important heterogeneity of cervical lesions making them more difficult to interpret (association between HIV and FGS, association between HIV and other STI, association between HIV-HPV-dysplasia)." Data were not collected from participants regarding CD4 count or HIV viral load for this study.

We agree that HIV-1 status should be assessed in future manuscripts which would evaluate associations between HIV and STI or HIV, HPV and cervical dysplasia. For the primary outcome of this study (agreement between expert reviewers), we acknowledge that HIV could induce heterogeneity in the lesions. However since the participants' HIV-1 diagnosis in this study was concordant between reviewers, and our primary purpose was to describe agreement between expert reviewers (irrespective of other participant characteristics), we believe that it is appropriate to describe agreement in visual-FGS diagnosis without adjustment for HIV or other factors that could influence the nature of the lesions.
"In the description of the method, it is stated that STIs were not screened for POC but later in the test it is written that a whole panel of STI was tested. Considering that the authors state that microbiological STI diagnosis can be a confounding factor in the diagnosis and symptoms associated with FGS, it seems important to me that they are taken into account in the analysis. As this is not the case here it would be relevant to justify why." Thank you for this enquiry. "In the statistical analysis section it is described that the OR was adjusted for age. I would like the authors to describe the rationale for this adjustment and why other confounding factors were not included in the adjustment (STI, HIV, cervical dysplasia). Age seems to be to be relevant for the association with delay of conception but I do not understand the interest of using it for the other variables (it should be noted that the unadjusted and adjusted ORs are not different which speaks to a moderate interest for the inclusion of this adjustment factor)." Thank you for the opportunity to clarify the statistical methods. Age was prioritized as a confounder to include since it has been associated in the literature with both FGS and presence of symptoms. We also investigated the effect of adjusting for HIV status and found it made no difference to our findings. As described above, STI and cervical cancer testing were performed on a subset of well under half of the participants. Including these variables as potential confounders in the regression would have led to further data sparsity. Despite this, we conducted an additional exploratory analysis where we controlled for STI (and separately, VIA) when assessing the association between FGS diagnosis and symptoms, restricted to the sub-group of participants who had VIA (or STI) information available (i.e. in order to assess the potential confounding affect of VIA among the same subset of participants, comparing like with like). In this exploratory analysis we found no suggestion that VIA acts as a confounder. As already noted, power to detect associations was reduced in this analysis.
"In the method, additional data on the number of photos per patient sent to reviewers would be an enhancement as this could give more information on what they based their evaluation on. It would also be interesting to have more detailed information on the quality of the images to help understand the the discrepancy between the two reviewers' analysis. Was a precise description of the lesions made by each of the reviewers which would allow a better understanding of the level of disagreement. It is noteworthy that although not reaching statistical significance, both reviewers consistently identified a greater proportion of FGS lesions in individuals with objective markers of schistoomiasis. If the sample size has been larger, statistical significance would probably be reached for some variables. If the data are available, I think it would be very interesting to exploit the STI diagnoses here to see how they probably contributed to the heterogeneity of the analysis." "I find the presentation of the results not very easy to understand. The authors refer several times to the non-association without describing precisely the significance of the analysis which is actually the lack of a statistically significant difference between patients diagnosed by positive or negative visual inspection by the reviewers in relation to objective parameters of schistosome infection. Again it should be noted that even if the difference is not statistically significant, the ORs are still positive in favor of thee objective measurements of schistosomiasis. I suggest rewording the text below Table 3. It is interesting to note that the symptom with the highest OR for both reviewers is for the association with hematuria, which is the most specific symptoms of Shaematobium infection." Thank you for this input and we apologize for any lack of clarity in the presentation of the results.

Daniela Fusco
Infectious Disease Epidemiology, Bernhard Nocht Insitute for Tropical Medicine, Hamburg, Germany The manuscript deals with a very relevant topic in need for urgent solutions. The methods and design are adequate for the study and the results are properly described.
Some minor recommendations to the manuscript: Title: In the title exclusively the aspect of the agreement of experts emerges, while in the results there is a long description of a regression analysis to associate factors (i.e. symptoms) to visual FGS. This aspect should emerge in the title. Additionally in the abstract it is stated that "The agreement of expert human reviewers for visual-FGS has not been previously described". In this view, the hand-held colposcopy doesn't play a central role in the study hence it should be considered to be removed from the title. ○ Methods: In the paragraph "Home and clinic-based sample collection" the type health care professionals performing the sampling should be specified. In the paragraphs "qPCR for detection of Schistosoma DNA" and "Other infections" samples storage conditions should be specified.
○ Discussion: As per in the title, in this statement "This study shows "slight" agreement between senior, highly experienced expert reviewers, highlighting the imperfect nature of human expert review of images obtained with hand-held colposcopy for FGS." the reader can have the feeling that the agreement might be different if another type of colposcope would be used. It would be advisable to reconsider the statement. From the reference 26 the statement "The Tanzanian study illustrates the limited specificity of visual techniques, since one-third of the women had cervical lesions in communities where S. haematobium is not endemic." cannot be really deduced since, i.e., travel or medical history of women is not described. In this view this statement doesn't seem to reflect the message of the reference and it should be re-considered. The conclusion statement "...we suggest caution when visual imaging is used as a stand-alone FGS diagnostic." should bring some recommendations i.e. on how to interpret or obtain better results through colposcopy since, so far, colposcopy is still the diagnostic standard for the disease and alternatives, even if urgent and needed, are not really available.

Is the work clearly and accurately presented and does it cite the current literature? Yes
Is the study design appropriate and is the work technically sound?
human expert review of images obtrained with hand-held colposcopy for FGS". The reader can have the feeling that the agreement might be different if another type of colposcope would be used. It would be advisable to reconsider the statement. From the reference 26 the statement "the Tanzanian study illustrates the limited specificity of visual techniques since 1/3 of the women had cervical lesions in communities where S. haematobium is not endemic" cannot really be deduced since i.e. travel or medical history of women is not described. In this view this statement doesn't seem to reflect the message of the reference and it should be reconsidered. The conclusion statement, "we suggest cauthion when visual imaging is used as a stand-alone FGS diagnostic" should bring some recommendations on how to interpret of obtain better results through colposcopy, since, so far, colposcopy is still the diagnostic standard for the disease and alternatives, even if urgent and needed, are not really available." Thank you for your input on the discussion. 1.
Up to 50 mL of urine was centrifuged and examined -the authors should provide the specific amount of urine centrifuged, and why 50 mL and not 10 mL as recommended by WHO (2022 1 ), and for ease of comparison with other studies.
2. Figure 1. Flowsheet of cervicovaginal image review -the authors should explain what "Images inaccessible" means, particularly as it was among those who attended clinic. 3.
of FGS/PCR detection of FGS were evaluated.
The topic is highly relevant and only a few studies have evaluated the accuracy of visual diagnosis of FGS -even though the visual inspection is the golden standard as also highlighted by the authors.

Introduction:
No comments.

Methods:
I would recommend more detailed information on the hand-held colposcopy and image review.
There is no information on how the image was captured and if they had a standard/protocol on which areas to include in the image portfolio (was it only the cervix or also the vaginal walls?). Did they have any criteria for image quality (for example the lighting or zoom level) when capturing the image? How many images were captured per woman? It seems like the reviewers have excluded images on slightly different basis -did you before the review guide the reviewers on how to include or exclude the images? For example, did the reviewers review the images on a computer and were the computers similar (for example resolution on the screen, color setting, brand etc.)? Did they have the opportunity to zoom the image or change color/brightness/contrast/saturation? My experience is that an image can look completely different if you evaluate it on two different computers. A small thing as day light or artificial light can change the image. Could it be different settings/equipment that account for some of the difference between reviewers?
In the methods section I can't find information on how women reported the abdominalgenitourinary, and reproductive manifestations (table 4) -questionnaire, interview etc.?

Results:
I am not surprised that you do not find an association between visual-FGS and PCR-FGS. I believe that cevicovaginal lesions are results of chronic inflammation and therefore other tests should be used if an association should be found.

Discussion:
You discuss the study from Madagascar, where gridded image technique was used. And you are correct that all included women in that study were to have FGS lesions. I just need to address two things 1) the presence of rubbery papules was not an inclusion criterion for the study. Therefore, the uncertainty about whether a lesion was present or not was indeed an issue. Some of the reviewers in that study found, that more than 20% of the women had no rubbery papules. 2) the images used for the study were not from the inclusion visit, but from later visits where treatment with Praziquantel had been initiated. For that reason, some of the images could be without FGS lesions.
I agree with you that consensus on how to evaluate the images is very important. I believe that if consensus had been reached before the reviewing procedure, your results would have been very much different.
before the review guide the reviewers on how to include or exclude the images? For example, did the reviewers review the images on a computer and were the computers similar (for example resolution on the screen, color setting, brand etc.)? Did they have the opportunity to zoom the image or change color/brightness/contrast/saturation? My experience is that an image can look completely different if you evaluate it on two different computers. A small thing as day light or artificial light can change the image. Could it be different settings/equipment that account for some of the difference between reviewers?" At their discretion, expert reviewers could exclude images that they felt could not be evaluated due to technical issues, image quality, or limited cervical visualization. The reviewers independently evaluated the images on desktop computers. The reviewers were able to zoom and make adjustments to the digital images as needed. We feel it is important to consider that in realworld conditions, if a number of technical factors are required to ensure successful image review, the successful roll-out and scale up of this method of diagnosis will be constrained. To replicate real-world conditions, standardized equipment was not provided to the reviewers by the BILHIV study. However, you are correct that we cannot exclude that differences in settings and equipment may theoretically contribute for some differences between reviewers. We have added this as a limitation in the discussion of the manuscript at lines 392 -395.
"In the methods section I can't find information on how women reported the abdominalgenitourinary, and reproductive manifestations (table 4) -questionnaire, interview etc.?" Women reported symptoms in a structured questionnaire with a community health worker, this is described in line 139.
"Results: I am not surprised that you do not find an association between visual-FGS and PCR-FGS. I believe that cevicovaginal lesions are results of chronic inflammation and therefore other tests should be used if an association should be found." We agree that visual-FGS represents a different phenotype than PCR-FGS. Thank you for your input on the results.

"Discussion:
You discuss the study from Madagascar, where gridded image technique was used. And you are correct that all included women in that study were to have FGS lesions. I just need to address two things 1) the presence of rubbery papules was not an inclusion criterion for the study. Therefore, the uncertainty about whether a lesion was present or not was indeed an issue. Some of the reviewers in that study found, that more than 20% of the women had no rubbery papules. 2) the images used for the study were not from the inclusion visit, but from later visits where treatment with Praziquantel had been initiated. For that reason, some of the images could be without FGS lesions." Thank you for the helpful clarifications regarding your outstanding work from Madagascar. In light of the further information you have provided, we have removed the sentences in question in the discussion. These formerly read "However, it is notable that in the Madagascan study, all images were thought to contain FGS lesions, removing the burden of uncertainty and highlighting that in settings where images are known to contain FGS lesions, agreement between reviewers was at best "moderate". I agree with you that consensus on how to evaluate the images is very important. I believe that if consensus had been reached before the reviewing procedure, your results would have been very much different. Thank you for this input and we included the absence of an initial consensus review as a limitation in the discussion at line 407.
Competing Interests: No competing interests were disclosed.