Self-supervised learning-based cervical cytology for the triage of HPV-positive women in resource-limited settings and low-data regime

Screening Papanicolaou test samples has proven to be highly effective in reducing cervical cancer-related mortality. However, the lack of trained cytopathologists hinders its widespread implementation in low-resource settings. Deep learning-based telecytology diagnosis emerges as an appealing alternative, but it requires the collection of large annotated training datasets, which is costly and time-consuming. In this paper, we demonstrate that the abundance of unlabeled images that can be extracted from Pap smear test whole slide images presents a fertile ground for self-supervised learning methods, yielding performance improvements relative to readily available pre-trained models for various downstream tasks. In particular, we propose C ervical C ell C opy-P asting ( C 3 P ) as an effective augmentation method, which enables knowledge transfer from open-source and labeled single-cell datasets to unlabeled tiles. Not only does C 3 P outperforms naive transfer from single-cell images, but we also demonstrate its advantageous integration into multiple instance learning methods. Importantly, all our experiments are conducted on our introduced in-house dataset comprising liquid-based cytology Pap smear images obtained using low-cost technologies. This aligns with our objective of leveraging deep learning-based telecytology for diagnosis in low-resource settings.


Introduction
Cervical cancer is considered nearly completely preventable but continues to be a leading cause of cancer mortality.In 2020, about 342 000 women died from this disease, most of them in developing countries where cytology-based screening programs to detect and treat precancerous lesions are not available or affordable Sung et al. (2021).
In the knowledge that HPV is the etiological factor that drives cervical cancer development, secondary prevention with human papillomavirus (HPV) testing has, in recent years, become the preferred screening method in many high-income settings.It is recommended by the WHO for women aged >30 years in low-and-middle-income countries (LMICs) Organization et al. (2021).Its high sensitivity and negative predictive value in detecting cervical intraepithelial neoplasia grade 2 or worse (≥CIN2) allow extended screening intervals.Recently, the development of fully automated diagnostic devices providing rapid HPV testing of selfobtained vaginal samples has offered a great opportunity to improve the effectiveness of cervical cancer prevention in low-resource contexts Saidu et al. (2021).
However, a single HPV test has limited specificity and can lead to unnecessary workup and overtreatment.Therefore, a triage strategy is required for HPV-positive women to mitigate this difficulty.Cytology is generally proposed as it is an effective method for triaging HPV-positive women von Karsa et al. (2015), but in low-resource settings, various logistic and operational reasons prevent successful cytology implementation.Amongst other barriers, cytological triage can be time-consuming, and in countries that use cytology as a triage method, results are typically unavailable on the same day as sample collection.In lower-income settings, loss to follow-up means that this becomes a seriously limiting problem.In these settings, therefore, rapid tests that give same-day results and lead to decisions about treatment are preferred.
A solution for countries with limited resources could be affordable digital imaging technology for real-time remote cytologic diagnosis by specialists Vassilakos et al. (2023).Using this scheme, the preparation and digitization of cervical smears from HPV-positive women would be performed on-site during the same visit using a "test-triageand-treat" approach (3T-approach) Levy et al. (2020).This process eliminates the need for in-house cytopathologists and might allow for reliable, cost-effective triage of HPVpositive women.Furthermore, to facilitate the visual analysis of Pap slides and reduce the screening time, deep learning-based algorithms could be used to obtain a rapid and accurate cytological diagnosis allowing a "same-day treatment".
The emergence of affordable and portable high-resolution scanners, such as the Grundium Ocus ® 40, along with lowcost slide preparation procedures like SurePath ™ , creates a favorable environment for this endeavor.When it comes to the learning algorithm, the main expenses are associated with the annotation process and the level of expertise it demands.Nevertheless, acquiring a large and well-curated annotated dataset proves challenging and time-consuming.Therefore, we investigate the application of self-supervised learning (SSL) methods to effectively utilize the abundant unlabeled images freely available from whole slide images (WSIs) of Pap smear tests.Similarly, we analyze the potential and difficulties of deep learning-based cervical cytology diagnosis and systematically report our findings.Specifically, we unveil the following aspects of deep learning-based Pap smear cytology: • We thoroughly evaluate the ability of models pretrained with self-supervised learning to learn meaningful visual representations of cytology images for various downstream tasks.In particular, the resulting representations show superior discriminability and generalizability; • Our experiments reveal that representations learned with publicly available single cervical cell datasets, e.g., Herlev Jantzen et al. (2005), or Sipakmed Plissiti et al. (2018), do not generalize well to different modalities such as images representing multiple cells.
To mitigate this issue, we propose a data augmentation strategy tailored for cytology images dubbed Cervical Cell Copy-Pasting (C 3 P).Furthermore, we demonstrate the effectiveness of C 3 P for learning generalizable representations from single-cell datasets; • We experimentally observe that multiple instance learning (MIL), the commonly used strategy for obtaining WSI-level representations and predictions, does not fully exploit the inherent properties of Pap smear cytology slides.Consequently, we introduce a set of simple yet effective modifications, e.g., processing only the top-k most suspicious instances, to better align MIL methods with Pap smear test images; • We present a medium-size liquid-based cytology Pap smear test images dataset from HPV-positive women.The slides of this dataset are prepared with the SurePath ™ procedure, which results in a small cell-deposit area.This shortens the time of digitization and yields smaller WSIs files, which is ideal for our telecytologicbased same-day "test-triage-and-treat" objective.The presented dataset is particularly challenging as all samples are from HPV-positive women, and negative slides typically portray signs of infections, which complicates the diagnosis.

Related Work
The advent of digital microscopy and the recent evidence that digitally based diagnostic performance is on-par with light microscopy for Papanicolau test slides Kholová et al. (2022) provides an interesting alternative.The added possibility of incorporating computer-assisted and/or automated diagnostics makes this an even more exciting prospect.Histopathology-based assessment is considered the "gold standard" for most cancer-related diagnostics, and histology attracts far more attention from the machine learning community Abbet et al. (2022); Stegmüller et al. (2023); Bozorgtabar et al. (2021) than cytology.Recently though, cytopathology has gained more traction and recognition, as it offers a non-invasive and inexpensive diagnostic tool suited to resource-constrained countries.Consequently, there have been significant advancements in machine-learning approaches applied to cytology.
Most of these advancements focus on cell-level tasks, e.g., classification Lin et al. (2019); Albuquerque et al. (2021), detection Liang et al. (2021); Li et al. (2019) or segmentation Hussain et al. (2020).These innovations aim to improve the efficiency and accuracy of cervical cancer screening and other forms of cancer diagnosis from cytological samples.The classification of whole slide Pap smear test images remains significantly less studied, despite its practical application being the most promising.Notable works on the topic include Cheng et al. (2021), who combined lowand high-resolution stages for the identification/localization of suspicious lesions and their classification.The highresolution stage relies on a recurrent neural network (RNN)based classification model to predict the WSI-level scores.Another study, Wei et al. (2021), leveraged a YOLO-based Redmon and Farhadi (2018) approach to generate cell/tilelevel predictions in the first stage and a transformer model for the aggregation and WSI-level classification in the second stage.Similarly, Cao et al. (2021) proposed the integration of an attention module to detect abnormal cells in large patches and computed the abnormality probability of a given patch as the average of its constituent cells.The WSI-level score is obtained by averaging the abnormality of its patches.Recently, Li et al. (2023) proposed a three-stage pipeline for lung cancer cytopathological WSIs classification.Their approach integrates a transformer-based model to extract fine-grained lesion features, which are then aggregated into intermediate patch-level features, and coarse-grained features for final WSI-level classification.

Datasets
In-house dataset.We present our in-house dataset composed of a cohort of 307 Pap smear slides from HPV-positive patients.The prevalence of cytology-positive slides is approximately 20%, translating to 69 positive and 238 negative slides.The preparation of the slides follows the SurePath ™ procedure.This choice is motivated by the overall objective of the telecytologic diagnosis of cervical smears for the triage of HPV-positive women in a resource-limited setting.Indeed, this preparation yields a small cell-deposit area, which shortens the scanning time and reduces the size of the digitized slides.Additionally, the SurePath ™ procedure exists in a manual and low-cost version to further ease its adoption in a low-income setting.Along the same line, the slides are digitized with the Grundium Ocus ® 40 scanner: a portable and affordable solution.The WSIs are acquired with a 12 megapixels image sensor, a 40x objective, and Zstacking (3 focal planes spaced by 1).
After digitization, cell-level annotations are obtained using QuPath Bankhead et al. (2017), resulting in a total of 1228 annotated positive cells.The annotations are used to create a dataset of 1228 positive cell images and as many

Method & Experiments
In Section 1, we evoke the urge to develop robust machine-learning methods for diagnosing cervical cancer based on Pap smear cytology in a low-resource setting.The emergence of affordable and transportable high-resolution scanners, e.g., Grundium Ocus ® 40, as well as the existence of low-cost slide preparation procedures (SurePath ™ ), make for a particularly ripe ground for this enterprise.
Nonetheless, collecting extensive and meticulously curated annotated data is time-consuming and expensive.Consequently, we investigate self-supervised learning methods and approaches that require few labeled samples.More precisely, in Section 4.1, we provide empirical evidence that self-supervised learning approaches can be successfully leveraged to learn meaningful representations of multiplecell images, i.e., unlabeled tiles.Our proposed cell augmentation method C 3 P is discussed and extensively tested in Section 4.2.Finally, in Section 4.3, we provide and discuss simple, yet effective tools tailored to existing MIL methods for cytology Pap smear test WSIs.

How well do self-supervised models transfer to cytology images?
While self-supervised learning (SSL) methods have gained attention for diverse downstream tasks in histopathology images, these approaches have received little attention for cytology-related tasks due to little evidence of their feasibility on Pap smear cytology images.Most state-of-the-art SSL methods He et al. (2020); Zhou et al. (2021); Caron et al. (2020Caron et al. ( , 2021) ) for image-level representations learning rely on maximizing the similarity of an image's representation under information preserving transformations.Crucially, one of these transformations is a spatial crop, which is at risk of losing its information-preserving property on cytology images as a consequence of the preparation of the slides that break long-range spatial dependencies.On the contrary, each digitized cytology slide can yield thousands of unlabeled images, which indicates that cytology could potentially be a playground where SSL methods thrive.Self-supervised pre-training.We pre-train our models, using DINO Caron et al. (2021) as the self-supervised learning framework.This choice is motivated by its strong nearest neighbor classifier capability and excellent performance across different backbones.DINO relies on a pair of Siamese teacher-student networks and a knowledge distillation approach.The underpinning principle of the method is to train the student network to mimic the teacher's output distribution when both models are fed with distinct views of the same input image.DINO leverages both global views and local views.The former typically spans a larger image region and captures image-level dependencies, while the latter occupies a fraction of the image and yields localized features.By leveraging views at different scales, local-to-global consistency can be distilled from the teacher to the student network.Compared to contrastive learning approaches (Chen et al., 2020;He et al., 2020), self-distillations methods (Caron et al., 2021;Grill et al., 2020) must explicitly avoid the collapse of the learned representation to trivial solutions.In particular, DINO only updates the teacher network's weights with an exponential moving average (EMA) of those of the student network.Towards the same objective, the entropy of the teacher's output distribution is constrained with sharpening and centering tricks.
We experiment with two types of architecture, ResNet-50 He et al. (2016) and vision transformer (ViT), ViT-S/16 Dosovitskiy et al. (2020), for which we use the recommended parameters available on the official repository.The arguments only differ from the recommendations for the batch size and the number of local crops.The batch size is set to fill the available GPU memory, i.e., batch_size = 256 for a ResNet-50 and batch_size = 192 for a ViT-S/16.We do not use local crops as they can result in ambiguous positive pairs for non-object-centric datasets, as is the case here.For each architecture, we train one model per stratified split (see Sec. 3) for 300 epochs; hence we obtain four pre-trained models for each architecture.
In all the following experiments, we compare the quality of the learned visual representations under the abovedescribed setting to the ones obtained under a supervised pre-training on ImageNet-1k.For the ResNet-50 architecture, we use the weights provided by PyTorch Paszke et al. (2019), whereas, for the ViT-S/16, we rely on the weights of Touvron et al. (2021) (trained without distillation).Cell-level classification.After model pre-training, we probe the quality of the learned features on a cell-level classification task.We opt for a k-NN classifier in limiting the manual intervention to the minimum, thereby obtaining results that reflect the learned representations' quality.We use two publicly available cervical cell Pap smear test datasets: the Herlev dataset Jantzen et al. (2005) and the Sipakmed dataset Plissiti et al. (2018).These datasets are randomly split in train/validation with a 75/25 partition.We report the mean and standard deviation of the class-wise and weighted  1 scores over 4 independent runs for each pretrained model, i.e., a total of 16 for the models pre-trained under the self-supervised framework (see Sec. 4.1) and 4 runs for the supervised ones.The number of neighbors  is selected to maximize the weighted  1 score.
In Tables 1 and 2, we observe that despite being pretrained without any labels and not on isolated cells, the models resulting from DINO's pre-training are on-par or better than the ones pre-trained on ImageNet-1k, which are competitive baselines and the de facto choice for most practitioners.It further appears that the Herlev dataset is more challenging, especially with fine-grained class labels.However, the representations are good enough to differentiate negative cells from positive ones.Tile-level classification.The representations learned via self-supervised learning are also evaluated on a tile-level classification task.As for the cell-level classification task, we rely on a k-NN approach.To that end, we prepare a labeled tiles dataset composed of our in-house 1228 positive tiles (see Sec. 3) and as many tiles randomly sampled from negative slides.The k-NN classifier is fitted on 75% of the resulting dataset and tested against the remaining 25%.We use the same evaluation setting as for the above-described cell-level classification task.Various conclusions can be drawn from the results depicted in Table 3. First, we achieve superior performance for tile-level classification using a simple k-NN classifier and off-the-shelf pre-trained models.Nonetheless, a given positive tile typically represents more negative cells than positive ones; hence it could be anticipated that the positive signal of a tile would get overshadowed.Secondly, we observe that the self-supervised pre-training yields a significant boost in performance when the pre-training and target datasets are well aligned.Overall, it is remarkable that the self-supervised pre-trained models of DINO transfer well to cytology images, while the DINO approach is originally tailored for object-centric datasets.Furthermore, the quality of the classification obtained with a k-NN classifier only seems to imply that the SSL models do not encode multiple cells as a single pattern, as it would not allow for the matching of positive tiles.We postulate that this is a consequence of the random cropping operation.

Cervical Cell Copy-Pasting: C 3 P
In Section 4.1, we discuss the applicability of selfsupervised learning to cytology images and report evidence of its effectiveness.As much as self-supervised learning is an adequate approach that can yield semantically coherent clusters of image representations, it doesn't allow for the labeling of the aforementioned clusters.We investigate if this labeling operation can be performed using publicly available datasets.The major obstacle to achieving this objective is that most public datasets are at the cell level annotations, whereas a lot of cytology tasks, e.g., whole slide image classification, require patch/tile level representations and annotations.Consequently, we first show that naively using models trained on cell-level datasets does not transfer well to tile-level downstream tasks.We then propose a simple yet effective method to palliate this issue.The results reported in Table 4 clearly show that a direct transfer learning from cells to tiles with a k-NN classifier performs poorly.More precisely, it can be observed that the models pre-trained in a supervised cannot detect the discriminant signal from the positive tiles.Nonetheless, we cannot conclude that the failure is a consequence of the shift in modality, i.e., single-cell images to multi-cell images, and not due to the small capacity of the classifier, the backbone, or another domain discrepancy between the source and target datasets.Furthermore, the self-supervised pre-trained models generalize better on this task.Cells to tiles transfer learning with pasting.The above paragraph reveals the inability of a k-NN classifier to transfer learning from cells to tiles images.To shed some light on the underlying cause of this failure, we repeat the same experiment except that, as a pre-processing step, we use the proposed augmentation i.e., we paste all the cells from Herlev or Sipakmed upon randomly sampled tiles from negative slides, referred to as canvases.The label of the pasted cell is attributed to the resulting pasted tile.In this first pasting scenario, we use the most straightforward pasting technique, which is referred to as the paste strategy.paste: The strategy relies on a two-step procedure to paste a cell on a tile: i) the pasting location of the cell is uniformly Table 5 Transfer learning results from Herlev and Sipakmed to our inhouse labeled tiles dataset using C 3 P-paste.A k-NN classifier is fitted on the pasted cells from the Herlev (H) and Sipakmed (S) datasets, and evaluated on the in-house set of positive cells/tiles.The features are extracted by a ViT-S/16 or a ResNet-50 pre-trained under a supervised pre-training on ImageNet-1k or a self-supervised pre-training on our in-house unlabeled tiles dataset using DINO.The highest mean score for a given source dataset and backbone is highlighted in bold.sampled among all the positions that would allow the cell to fit entirely in the tile, and ii) the pixels of the tile in the pasting site are replaced by those of the cell.
As can be seen in Table 5, the proposed augmentation significantly improves the ability of the classifier to detect positive cells in tiles in spite of the large distribution shift between the cells of Helev/Sipakmed and the ones represented in our in-house tiles, the small capacity of the classifier and that tiles resulting from paste do not look natural.The t-SNE Van der Maaten and Hinton (2008) mapping depicted in Figure 4 shows that the cells and labeled tiles representation are mapped to different regions of the space.Conversely, positive cells augmented with C 3 P-poisson appear to be close to groups of positive tiles, reflecting the improved alignment obtained with our augmentation strategy.

Pasting technique:
In Table 5, we showed that the proposed pasting method could significantly improve the transferability from opensource single-cell datasets to tiles representing multiple cells.Although the pasting method (paste) used to generate the results depicted in Table 5 works, it is coarse and doesn't produce natural-looking images.As such, it can result in the model focusing exclusively on the pasted regions throughout training, hence performing poorly at test time.Therefore, we investigate if this scenario occurs and if better alternatives exist.In addition to paste, we test two other alternatives referred to as blend and poisson.Examples of samples obtained with the different pasting methods are depicted in Figure 3. blend: The only difference w.r.t.paste is that, instead of replacing the pixels of the canvas with those of the cell, the pixels of the pasting site result from a convex combination of those of the cell and canvas: where  paste is sampled uniformly at random from the interval [0, 1].Due to the transparency of the pasting operation, the resulting images look more natural as it mimics the effect of overlapping cells and the border of the cell image is less visible.
poisson: The main pitfall of the blend strategy is that it can only conceal the boundaries of the pasting site by concealing the cell, which is undesirable.Poisson blending Pérez et al. (2003) was precisely proposed to mitigate that issue.Indeed, the blending operation is formulated as an optimization problem, which aims at computing the values of the pixels in the pasting site to preserve the gradients of the source/cell image while matching the pixel intensities of the target/canvas image at the boundaries.
We train a linear classifier on top of the pre-trained models with different pasting operations.In this experiment, 1000 unlabeled tiles are used as canvases for each class (negative/positive), and labeled tiles are obtained online by pasting a randomly selected labeled cell upon one of the canvases.Noteworthy that positive cells are pasted upon unlabeled tiles from positive slides and reciprocally for negative cells.After training, the classifier is evaluated on the in-house labeled tiles.We report the class-wise  1 score averaged over 4 independent runs per pre-trained weights.The scores are further averaged over the pre-training splits (see Sec. 4.1).For this experiment (and the ones that follow), we only use models pre-trained in a self-supervised manner as they have shown to be on par or better than their supervised counterparts.
Table 6 shows that the blend approach yields worsen results compared to paste.We postulate that this is a consequence of  paste either being too low and the resulting images not looking more natural than the ones produced with paste, or it being too high and the pasted content being barely visible.On the contrary, we observe that the poisson technique performs similarly to paste for all backbone/dataset combinations, except for the ViT-S/16 + Sipakmed scenario, in which case it is the only pasting technique that yields decent results for the classification of negative tiles.Pasting probability: So far, we have applied the pasting operation in a perfectly symmetric manner, i.e., it is systematically applied independently of the cell's label and that of the slide from which the canvas is extracted.Nonetheless, our setting is inherently asymmetric: on one side, we know with certainty that tiles extracted from negative slides are all negatives; on the other side, little can be said with regard to Table 7 Ablation experiments for pasting probability.A classifier is trained on cells from Herlev or Sipakmed with various probabilities of applying C 3 P-poisson on negative (-) and positive (+) tiles.The classifier is then evaluated on the in-house labeled tiles.We report the class-wise and weighted  1 scores.The highest mean score for a given source dataset, class, and backbone is in bold.The selected pasting method is highlighted.the label of tiles extracted from positive slides.Furthermore, by systematically using C 3 P, we are encouraging the model to only attend to the pasting site which is undesirable.We propose to exploit the asymmetry of the setting and not systematically use C 3 P on negative tiles.This further allows the model to learn from real negative examples without the risk of feeding mislabeled samples to the model.Therefore, we replicate the experiment of Table 6, but this time, C 3 Ppoisson is applied on the unlabeled tiles from negative slides with a given probability (see Tab. 7).Although never applying C 3 P to negative tiles is the scenario in which the model processes the most realistic samples, we observe in Table 7 that it can be harmful.This observation is unsurprising, considering that in that setting, the positive label is perfectly correlated with the pasting operation.It is in fact surprising that it does not perform even worse.We argue that this is in part due to the ability of C 3 P-poisson to fool the model.On the contrary, the models trained using C 3 P with a probability of 0.5 seems to perform favorably compared to the ones using it systematically.Noteworthy that in that setting, the positive label is correlated with the action of pasting.How many canvases are required?:To answer this question, we repeat the experiment of Table 6, with a 0.5 probability of applying C 3 P-poisson and a varying number of canvases per class.

Herlev
Figure 5 depicts the class-wise  1 scores for each available backbone/dataset combination.It appears clear that, up until ≈ 2000 canvases, increasing the number of canvases favorably impacts the classifier's performance.After that point, the model tends to overfit the pasted cells, which occur more often and independently of the canvases, which translates to a decreased downstream performance.C 3 P results: Our extended experiments reveal that C 3 P offers a well-grounded augmentation strategy to bridge the gap between publicly available single-cell and unlabeled tiles datasets.We further show that the proposed augmentation yields significant improvement compared to the approach of naively transferring from a classifier trained on single-cell datasets.In Table 8, we also show that our approach outperforms the naive transferring methods by a large margin with Table 8 Evaluation results of the cells-pasting augmentation method with transfer learning from Herlev or Sipakmed to our in-house tiles dataset.A classifier is trained on the cells dataset without and with C 3 P-poisson.We report the class-wise and weighted  1 scores.The highest mean score for a given backbone, class, and source dataset is highlighted in bold.a classifier trained with C 3 P-poisson, a pasting probability of 0.5, and the optimal number of canvases (see Fig. 5).

Aligning MIL to cytology images
In Sections 4.1 and 4.2, we showcase the benefits of selfsupervised learning for cytology diagnostics and propose an augmentation C 3 P strategy to make the most out of publicly available single-cell datasets.Combined together, this offers the opportunity to design Pap smear WSIs classification modules requiring few labels.More precisely, we harness the power of self-supervised learning and our augmentation strategy C 3 P to better align well-established MIL methods for Pap smear WSIs classification.Problem formulation.As a primer, we briefly revisit the underlying concepts and assumptions of the multiple instance learning framework.In a binary MIL setting, the objective is to correctly predict the label  ∈ {0, 1} of an input bag of instances  = { 1 , … ,   }, where  is allowed to vary from one bag to the other.The instance-level labels {  }  =1 ∈ {0, 1} are assumed to exist but to be unknown throughout the training phase.As such, the MIL objective can be formulated as the detection of positive instances ( = 1) within the bags, i.e.: (2) As pointed out in AbMIL Ilse et al. (2018), the above bag labeling function is permutation invariant w.r.t. the instance labels, hence so must be the predictions Ŷ = (), where  is the bag scoring function.In the context of cytology, WSIs are the bags and their constituent tiles are the instances.One can observe that the permutation invariance assumption is particularly well-grounded in that setting.Indeed, the overall diagnosis is based on the presence of abnormal cells within the entire slide rather than the specific arrangement or order of those cells.Furthermore, as a consequence of the slide preparation, the arrangement of the cells on the slides exhibits little to no ordering or positional dependency.
In most MIL methods, the slide-level representation  is obtained as a weighted sum/convex combination of the  instance-level representations  = { 1 , … ,   }: where   is a scalar that modulates the contribution of the  ℎ instance to the overall representation.The slide's score is obtained by feeding the slide-level representation to a classifier : As the instance-level representations   and the slide-level representations  span the same space, we argue that instancelevel predictions can be obtained with the same classifier: For each setting, we report the average and standard deviation of the slide-level and instance-level AUC scores.The instance-level score is computed using our in-house positive tiles and randomly sampled tiles from negative slides (both extracted from the test tiles).
Results discussion for MIL-based methods.In the first scenario, we experiment with the MIL methods using their default implementations.It can be observed in Table 9 that this setting is suboptimal for all MIL method/backbone combinations.This is intriguing, considering that the backbone demonstrated strong performances (see Tab. 3) at the tile level and that the chosen MIL methods are well-established baselines.We argue that this is a consequence of the particularity of Pap smear test images.Indeed, features correlated with negativity are present almost everywhere, even in positive tiles, and features correlated with positivity are scarce.Together, this makes for a particularly challenging setting to capture the positive signal in the slide-level representation, .

Top-k selection:
To mitigate the aforementioned issue, we propose only processing the top-k most suspicious tiles in each slide using the same backbone and classifier as for the slide-level predictions.Noteworthy that as the backbone is frozen, the tiles' representations can be pre-computed, which makes the identification of the top-k tiles not computeintensive.In all following experiments, we use  = 8 top-k tiles and a batch_size = 16.
Table 9 shows that adding a top-k module yields significant improvements for all MIL methods except for Trans-MIL.When using the top-k is beneficial for the slide-level predictions, we observe that it also benefits the tiles-level predictions, which is unsurprising considering that the slidelevel representation  is most likely closer to that of tiles when it results from a weighted-sum over 8 tiles representations than over the entire bag (see Eq. ( 3)).Along the same line, the poor tile-level performance of TransMIL is a potential explanation for its ineffectiveness at the slide level.Indeed, if the model cannot detect the positive tiles, the overall representation does not reflect the nature of the slide well.Tile-level objective: To improve the ability of the model to identify suspicious tiles, we propose to integrate a tile-level loss into the overall training objective: where  tile denotes the tile loss coefficient.We use a batch_size of 8 for the tiles and the optimal value of  tile is determined independently for each backbone/method combination.Moreover, since we aim for a method using only slidelevel labels, we explore the possibility of benefiting from C 3 P.As depicted in Figure 2, hard negative and confident positive tiles are collected throughout training and used as canvases where negative and positive cells can be pasted upon, respectively.We refer to hard negatives/confident positives as the 10 tiles having the highest positivity score in each negative/positive slide, respectively.The method used for pasting is C 3 P-poisson, and we rely on cells from both Herlev and Sipakmed.
As shown in Table 9, incorporating a localized objective alongside C 3 P yields significant improvements in the tilelevel predictions of TransMIL.Consequently, this facilitates the detection of suspicious tiles and ultimately enhances the accuracy of slide-level predictions.It is worth noting that C 3 P is not exclusively advantageous for TransMIL, as it proves to be beneficial at the slide level in all scenarios except for ViT-S/16 + AbMIL.We posit that a tile-level loss is implicitly enforced when employing a top-k selection approach.In other words, the representation of the top-k selected tiles is encouraged to be aligned with the slide label.Hence, when the top-k selection is accurate, C 3 P becomes less relevant as "real" labeled tiles are available.Nonetheless, this scenario seldom occurs, especially when the backbone is a ResNet-50, whose features have 2048 dimensions (> 5× more than for a ViT-S/16), which increases the number of parameters of the MIL module.We argue that this explains why settings relying on a ResNet-50 as backbone tend to benefit more from C 3 P.

Conclusion
This paper discussed the potential of deep learningbased telecytology to effectively reduce cervical cancerrelated mortality in low-and-middle-income countries by mitigating the need for trained cytopathologists on the triage site.To support this objective, we have collected a mediumsized dataset of Pap smear test images using SurePath ™ preparation, which exists in a manual and low-cost version, and digitized the slides using a Grundium Ocus ® 40 scanner, which fulfills our requirements of transportability and affordability.
Our experimental findings highlight the successful application of self-supervised learning to reduce the annotation burden, with the resulting representations outperforming off-the-shelf pre-trained models across various downstream tasks.Additionally, we have introduced C 3 P, an augmentation strategy, which effectively transfers knowledge from open-source and single-cell datasets to unlabeled tiles.C 3 P proves to be beneficial not only for tile-level classification but also for slide-level classification.Regarding the WSIs classification, our experimental findings reveal that MIL methods may overlook crucial characteristics of Pap smear images, which can be accounted for by introducing simple modifications that prove to be beneficial.Overall, classifying Pap smear WSIs relying solely on slide-level labels remains challenging, particularly in our scenario where all samples are from HPV-positive women, which adds an additional layer of complexity.Limitations.Our experiments are conducted on only one self-supervised learning method, namely DINO Caron et al. (2021) due to its strong performance on the k-NN evaluation benchmark and compatibility with various backbones.We argue that the main reason why SSL methods could be inadequate for cytology images is that the objective might enforce consistency between semantically unrelated views.Nonetheless, this potential pitfall results from the spatial cropping strategy, which is common to most self-distillation and contrastive methods.Our conclusions based on DINO are likely also applicable to other methods.Alternatively, larger vision transformer backbones, e.g., ViT-B/16, would be worth investigating, yet lighter architectures, such as ViT-S/16 and ResNet-50, remain better suited in the low-data regime.

Figure 1 :
Figure 1: Overview of the in-house dataset.a) The  WSIs are labeled as positives (  samples) or negatives (  samples).b) 1228 cell-level annotations are used to extract isolated positive cells and 320 × 320 pixels tiles, both at 20× magnification.c) The tiles of each WSI are extracted based on a regularly spaced grid, yielding ∼1.5M unlabeled tiles (320 × 320 pixels at 20× magnification) distributed over 307 slide bags.

Figure 2 :
Figure 2: Overview of the proposed MIL based method for classifying Pap smear WSIs.a) A positivity score is obtained independently for each tile of the input WSI, and the embeddings of the tiles having the top-k highest scores are extracted.b)The top-k embeddings attend to one another to produce the slide-level representation, where the positivity score is obtained using the same classifier as for the independent tiles predictions.c) The tiles corresponding to the top-k scores are stored as confident positives or hard negatives queues, depending on the slide-level label.d) Positive and negative cells are pasted upon randomly sampled confident positives and hard negatives, respectively.e) A score for each pasted tile is obtained using the same backbone and classifier.The model is conjointly trained to correctly classify WSIs and pasted tiles.

Figure 3 :Figure 4 :
Figure 3: Visualization of the different pasting approaches on randomly sampled tiles from the in-house dataset and random pasted cells from both Herlev and Sipakmed datasets.

Figure 5 :
Figure5: Box plots depicting the class-wise  1 scores against the number of unlabeled tiles used as canvases for the pasting augmentation.The performance achieved without the proposed augmentation can be observed at the zero of the x-axes.

Table 1
Cell-level classification results on Herlev.We report the class-wise and weighted  1 scores of a k-NN classifier.The features are extracted by a ViT-S/16 or a ResNet-50 pre-trained under a supervised pre-training on ImageNet or a self-supervised pre-training on our in-house unlabeled tiles dataset using DINO.The highest mean score for a given class and backbone are highlighted in bold.

Table 2
Cell-level classification results on Sipakmed.We report the class-wise and weighted  1 scores of a k-NN classifier.The features are extracted by a ViT-S/16 or a ResNet-50 pre-trained under a supervised pre-training on ImageNet or a self-supervised pre-training on our in-house unlabeled tiles dataset using DINO.The highest mean score for a given class and backbone are highlighted in bold.

Table 3 Tiles
-level evaluation results of the frozen models on the inhouse set of tiles.A k-NN classifier is fitted on 75% of the samples and evaluated on the remaining 25%.We report the class-wise and weighted  1 scores of 4 independent runs.The features are extracted by a ViT-S/16 or a ResNet-50 pretrained under a supervised pre-training on ImageNet or a selfsupervised pre-training on our internal dataset using DINO.The highest mean score for a given class and backbone is highlighted in bold..± .(+13.7).± .(+16.0).± .(+14.8).± .(+18.8).± .(+24.3).± .(+21.6) Cells to tiles transfer learning.We first evaluate the capability of a classifier trained on open-source cell-level datasets for the tile-level classification at test time.To that end, a k-NN classifier is fitted on the Herlev or Sipakmed datasets using only binary labels, i.e., negative or positive, and subsequently evaluated on our in-house set of labeled tiles.For each pre-trained model, we report the class-wise 1 score averaged over 4 independent runs, which only use 75% of the training set each.When the model is pre-trained in a self-supervised manner, the scores are further averaged over the pre-training splits (see Sec. 4.1).The number of neighbors  is selected to maximize the  1 score of the positive class.

Table 4 Transfer
learning results from Herlev and Sipakmed to our inhouse labeled tiles dataset.A k-NN classifier is fitted on the binary version of the Herlev (H) and Sipakmed (S) datasets and is evaluated on the in-house set of labeled tiles.The features are extracted by a ViT-S/16 or a ResNet-50 pretrained under a supervised pre-training on ImageNet-1k or a self-supervised pre-training on our in-house unlabeled tiles dataset using DINO.The highest mean score for a given source dataset and backbone is highlighted in bold.
Ablation results on pasting method.A classifier is trained on cells from Herlev or Sipakmed with C 3 P and various pasting techniques and subsequently evaluated on the in-house labeled tiles.We report the class-wise and weighted  1 scores.The highest mean score for a given source dataset, class, and backbone is in bold.The selected pasting technique is highlighted.

Table 9
Evaluation of the MIL-based methods before and after adding our augmentation C 3 P and top-k selection strategy for Papsmear test WSIs classification on our in-house dataset..
Considering the availability of 4 pre-training weights per backbone (see Sec. 4.1), the first one is used to determine the best hyperparameters, and the three remaining ones are reserved for evaluation purposes.
Caron et al. (2021)implementations.We remove the positional encoding from TransMIL as it brings little information in our setting, as discussed above.Each MIL method is tested with both types of backbones (ViT-S/16 and ResNet-50), which are initialized with the weights obtained from DINO's pretrainingCaron et al. (2021).In all experiments, the weights of the backbone are kept frozen.