Nuclei-level prior knowledge constrained multiple instance learning for breast histopathology whole slide image classification

Summary New breast cancer cases have surpassed lung cancer, becoming the world’s most prevalent cancer. Despite advancing medical image analysis, deep learning’s lack of interpretability limits its adoption among pathologists. Hence, a nuclei-level prior knowledge constrained multiple instance learning (MIL) (NPKC-MIL) for breast whole slide image (WSI) classification is proposed. NPKC-MIL primarily involves the following steps: Initially, employing the transfer learning to extract patch-level features and aggregate them into slide-level features through attention pooling. Subsequently, abstract the extracted nuclei as nodes, establish nucleus topology using the K-NN (K-Nearest Neighbors, K-NN) algorithm, and create handcrafted features for nodes. Finally, combine patch-level deep learning features with nuclei-level handcrafted features to fine-tune classification results generated by slide-level deep learning features. The experimental results demonstrate that NPKC-MIL outperforms current comparable deep learning models. NPKC-MIL expands the analytical dimension of WSI classification tasks and integrates prior knowledge into deep learning models to improve interpretability.


INTRODUCTION
In recent years, the incidence of new breast cancer cases has surged rapidly, surpassing lung cancer to become the world's most prevalent cancer type. 1 With the advancement of histological section digitization technology, whole slide image (WSI) offers the advantage of presenting tissue morphology with exceptionally high resolution. 2Also, the characteristic of extremely high resolution in WSI makes it possible to classify breast cancer, 3 segment lesion areas, 4 analyze the microenvironment, 5 and assess prognosis. 6he conventional classification process for pathological images adheres to the typical image classification steps: target segmentation, feature extraction, feature selection, and disease diagnosis.These four steps are carried out independently and then integrated for tuning.However, errors are inevitable in each step, and their accumulation can result in unreliable classification outcomes.Additionally, most traditional machine learning algorithms require loading all data into memory at once, which is not feasible for WSIs with gigapixels. 7he classification process based on deep learning is end to end. 8Thanks to the weight-sharing strategy 9 and the rapid advancement in computer computing power, 10 deep learning methods have made significant strides in the field of computer vision, propelling advancements in medical image analysis. 11iven the gigantic size of WSI, such as 100,000 3 100,000 pixels, the analysis scale of WSI progresses through two stages, from patch level to slide level.
4][15] Although processing in blocks reduces the algorithm's demands on computer hardware, it disrupts the integrity of the organization and compromises spatial information.The analysis of the entire WSI should fully consider spatial context information rather than being based on arbitrary patches. 16][19] At present, the improvement of the deep learning algorithm based on MIL focuses on how to model the spatial relationship of patches to better aggregate patch-level features into slide-level features, while ignoring the influence of nuclear-level features on the final classification results.The distribution and morphology of nuclei are an essential basis for cancer diagnosis. 20Although deep neural networks can theoretically approximate any function model, 21 it is still not widely recognized by pathologists due to its lack of interpretability. 22o fully leverage the capacity of deep learning methods in model fitting, while also considering the incorporation of handcrafted features with clear interpretations to enhance the interpretability of the model, this paper intends to investigate a way for classifying WSI constrained by nuclear-level handcrafted features.
The rest of the paper is arranged as follows: A brief review of some related work is presented below.Detailed ablation experiments and comparison studies are presented in the results section, followed by the discussion and conclusion part in the discussion section.Finally, a detailed overview of the methodology is provided in the STAR methods.
The combination of extremely large image sizes and the high expertise threshold in pathology makes it prohibitively expensive to acquire pixel-level WSI labeling data.However, acquiring slide-level labeling is comparatively more accessible.The scenario of 'The class of the entire WSI is known, while the class of all patches that make up the WSI is unknown' aligns precisely with the application of Multiple Instance Learning.In MIL, the training set comprises bags with known class labels, where each bag includes instances with unknown labels.A bag is labeled as positive if it contains at least one positive instance, and if all instances in a bag are negative, the bag is marked as negative. 23ence, an increasing number of researchers have approached digital pathological image classification as a weakly supervised classification problem based on the MIL algorithm.Within MIL-based deep learning algorithms, one of the most intuitive ideas is to aggregate patch-level features extracted by Convolutional Neural Network (CNN) into slide-level features through pooling operations. 24Ilse et al. 25 take 'the fundamental theorem of symmetric function and the fact that neural network layer has arrangement equivariant' 26 as the principle for network design.They propose an attention-based deep MIL algorithm, utilizing the attention mechanism to assign more flexible weights to patches.Li et al. 27 innovatively incorporated a non-local attention mechanism to tackle the MIL problem.Through evaluating the similarity between the patch with the highest score and other patches, distinct attention weights were assigned to each patch, thereby optimizing the aggregation of patch-level features into comprehensive slide-level features.][30] Nevertheless, the aforementioned studies are contingent upon the assumption that all instances in each bag are independent and equally distributed.In reality, instances often exhibit correlations, 31 necessitating the modeling of inter-instance relationships before applying MIL.Shao et al. 32 challenge the assumption of independent distribution, proposing a scoring function based on Hausdorff distance to assess correlations between instances.They leverage the Transformer framework 33 to incorporate the spatial topological relationships among similar patches.Philip et al. 34 select the c instances with the highest score using the instance classifier.Subsequently, they calibrate the distribution of instances by subtracting the features of these top-scoring c samples from the features of all instances in the bag.However, operating under the assumption that 'cancerous WSI have more information than normal WSI,' their model tends to prioritize features of cancerous WSI, thereby attenuating the recognition of features in normal WSI.
As the bag contains numerous instances (tens of thousands), refining the instances is necessary to reduce the number involved in calculations.Xie et al. 35 adopted the feature clustering method to divide the instances in the bag into e categories and then the characteristics of the instance where the centroid of e categories are fused as the characterization of the bag to make the final prediction.Lu et al. 36 sorted the instances according to the level of attention scores, and then took the c instances with the highest and lowest attention scores as inputs to train a patch-level classifier to constrain the slide-level diagnosis results, and achieved better results than diagnosis based on slide-level features alone.
In addition, to enrich the diversity of training samples, it is necessary to explore the impact of different data enhancement methods on classification accuracy: Yang et al. 37 proposed a ''mix-the-bag'' data enhancement method, which means adding, replacing, and interpolating among the instances of the bag of the same category to generate new virtual bags.Zhang et al. 38 randomly select instances from the parent bag to form sub-bags and assign the label of the parent bag to sub-bags.However, sub-bags derived from positive parent bag may not necessarily contain positive instances, leading to the introduction of sub-bags with incorrect labels.To address this issue, they designed a feature distillation branch to eliminate errors.
To fully harness the potential of deep learning methods in model fitting and incorporate handcrafted features for building interpretable deep learning models, this paper aims to explore a method for classifying WSI constrained by nuclear-level handcrafted features.

Construction of the NPKC-MIL
The construction process of nuclei-level prior knowledge constrained multiple instance learning (NPKC-MIL) can be summarized as follows: Initially, patch-level features are extracted from WSI and sorted based on the importance of patches.Subsequently, nucleus features (treat extracted nuclei as Graph nodes) are aggregated and updated using the Graph Convolutional Network (GCN).The losses are then calculated at the slide-level, patch-level, and nuclei-level.Finally, patch-level and nuclei-level losses are considered as penalty items and added to the total loss function for model training, as depicted in Figure 1.
Figure 1 outlines the four steps involved in constructing NPKC-MIL: First, utilizing the transfer learning method to extract patch-level features (Figure 1A) and rank the importance of patches using an attention mechanism (Figure 1B).Second, aggregating patch-level features into slide-level features through attention pooling and calculating patch-level loss L patch (Figure 1C) and slide-level loss L slide (Figure 1D) separately.Third, abstracting the extracted nuclei as nodes, constructing the node topology using the K-NN algorithm, extracting nuclei-level handcrafted features, aggregating and updating node information with GCN, and calculating nuclei-level loss L nuclei (Figure 1E).Fourth, considering L patch and L nuclei as penalty terms, they are added to the total loss function for model training (Figure 1F).

Nuclei-level feature design
There exists a perceptual understanding that 'cancerous nuclei are larger, rounder, and messier than normal nuclei' in the clinical diagnosis process.To quantitatively articulate these differences and aid in handcrafted feature design, we conducted five randomized trials on selected experimental datasets.Each experiment involved the collection of 100,000 nuclei from both cancerous WSI and normal WSI for statistical analysis.
Figure 2 illustrates the scatter density of normal and cancerous nuclei.A comparison of the high-density region in Figure 2 distinctly reveals that cancerous nuclei exhibit larger areas and a rounder shape.The elliptical characteristic in Figure 3 suggests that cancerous nuclei are rounder.Additionally, the quantitative analysis of characteristic parameters such as area and equivalent diameter ellipticity in Figure 4 indicates that cancerous nuclei are larger.

Geometric features
Hence, we extracted a set of 9-dimensional geometric features from the nucleus.These features encompass major axis length (Major), minor axis length (Minor), area (Area), direction q (the angle between the major axis and horizontal direction), eccentricity (Ecc), ellipticity (Ell), equivalent diameter (Dia, the diameter of a circle with an area equal to the nuclei area), perimeter(Per), and convex hull area (AreaHull).Figure 5 shows geometric features in part.

Texture features
Based on our understanding, cancerous nuclei exhibit a complex texture, whereas normal nuclei appear smooth.This observation is further supported by Figure 6, which illustrates the complexity of cancerous nuclei texture in contrast to the smooth texture of normal nuclei.
For a quantitative portrayal of the difference, Figure 7 illustrates two features characterizing nucleus texture.Comparative analysis reveals the following insights: 1) Cancerous nuclei exhibit more precise imaging, as indicated by the contrast characteristic reflecting the clarity of nuclei.This aligns with the clinical observation that cancerous nuclei appear hyperchromatic during H&E staining; 2) The texture of cancerous nuclei is more intricate, as evidenced by the entropy characteristic.Entropy serves as a measure of the information content within the image, and a higher value indicates a more complex texture.To this end, we extracted 7-dimensional texture features from the nuclei.These features encompass contrast (Con), dissimilarity (Diss), homogeneity (Hom), entropy (Ent), angular second movement (ASM), roughness (Rou), and dispersion (Dsip).
To visually illustrate the distinction between the corresponding characteristics of cancerous nuclei and normal nuclei, Table 1 presents the ratio of 16-dimensional features in 5 randomized trials.The color characteristics are excluded due to the substantial H&E color difference, as evident in Figure 6.A more detailed modeling process is described in the STAR methods section.

Dataset description
Currently, prominent public datasets relevant to auxiliary breast cancer diagnosis include BACH2018, 39 BreakHis, 40 and BRACS. 41The BACH2018 dataset comprises 10 normal WSIs and 20 cancerous WSIs.BRACS stands out as the most extensive dataset available to the author, with 547 WSIs post-update; however, only 423 WSIs were effectively obtained during this paper's experimentation.Within these, 252 were normal, and 171 were cancerous, resulting in a positive/negative sample ratio of 40.43%:59.57%.BreakHis, offering only patches cropped from WSIs without enabling the construction of slide-level features, is not utilized in this study.The provided data includes labeling information, with '0' denoting normal WSIs and '1' indicating cancerous WSIs.
WSIs exhibit strong heterogeneity and diversity stemming from various sources such as different image acquisition devices, scanning protocols, image gatherers, variations in the H&E staining process before scanning, and inherent differences among patients.These factors contribute to distribution drift in WSI images, and the diverse set of samples significantly influences the model's generalization ability.A model trained on an imbalanced sample set may acquire biased information about sample proportions, leading to predictions that overly focus on certain categories and compromise the model's robustness. 42o enhance sample diversity and maintain a balanced ratio of positive and negative samples, this study incorporates 20 cancerous WSIs from the BACH2018 dataset and 33 self-owned cancerous WSIs in addition to the BRACS dataset.The experimental dataset comprises 476 WSIs with a positive/negative sample ratio of 47.06%:52.94%,as depicted in Table 2.

Implementation details
Typically, human tissue does not cover the entire slide during slide preparation, resulting in WSIs containing significant blank spaces, as depicted in Figure 8. Efficiently extracting the human tissue region is crucial for conserving computing resources and enhancing the algorithm's efficiency.The process begins by leveraging the principle that the S-channel of the HSV color model represents a combination of specific spectral colors and white. 43WSIs are then transformed from RGB to HSV in the high pyramid (e.g., layer 4).Subsequently, the human tissue region is extracted using the OSTU method 44 in the S-channel, and any small holes in the tissue are filled through mathematical morphology closing operations. 45The extracted human tissue contour is filtered based on region thresholds, and the silhouettes are restored to the original image layer.
Furthermore, it's impracticable to load an entire WSI into the GPU directly, due to its large size (e.g., 100,000 3 100,000 pixels), so a patchbased processing approach should be employed. 46This involves using a sliding window of 512 3 512 pixels on the WSI within the extracted human tissue area, with the top-left coordinates of each sliding window recorded.The final step involves restoring the WSI based on these coordinates.Figure 8 illustrates the step-by-step processing procedure.
According to the percentage of 7:1:2, the experimental images were divided into training datasets, verification datasets, and test datasets.All experiments were performed on a single desktop computer with Window10 operating system, CPU with six cores and six threads, 3.00 GHz i5-8500 (32 GB memory), and the GPU GTX2080 with 8 GB memory.

Ablation experiment
To validate the effectiveness of handcrafted feature constraints, three sets of ablation experiments were conducted.The classification result based on slide-level features is denoted as ablation1 (macro feature), patch-level deep learning feature constraints are characterized as abla-tion2 (meso feature), and nuclei-level handcrafted feature constraints are indicated as ablation3 (micro feature).For simplicity, the model inference probability is denoted as PrePro (Predicted Probability).
As shown in Table 3, for WSIs (WSI-1, WSI-2) characterized by clear imaging and distinct features, both NPKC-MIL and the ablation experiments consistently achieve accurate classification with high probability.
For WSI-3, with a subclass as Flat Epithelial Atypia (FEA), which appears cancerous at the macro view (Figure 9A), ablation1 misclassified it as cancerous with a very high probability (0.9988).However, ablation2 significantly reduced the misclassification probability to 0.6992 by introducing the 1024-dimensional meso feature constraint.Although ablation3 is constrained by 16-dimensional micro features, its misclassification probability remains high (0.8867).This suggests that micro features play a restrictive role, but their impact is not as pronounced as that of the 1024-dimensional meso features, underscoring that the latter provides more effective constraint information than the former.
For WSI-4, with a subclass as invasive carcinoma (IC), ablation1 misclassified it as normal based on macro features (Figure 9B).However, upon introducing meso feature constraints (ablation2), the misclassification probability significantly decreased to 0.6487, indicating the effectiveness of the 1024-dimensional meso feature constraints.In contrast, relying solely on the micro 16-dimensional handcrafted feature constraint led to a high probability of misclassification as normal (0.9487).This emphasizes the challenge of exclusively focusing on micro features, encountering the dilemma of 'seeing the trees but not the forest.'Therefore, a combination of meso and micro characteristics is essential to constrain macro features correctly.Notably, NPKC-MIL classified it as cancerous with a probability of 0.5220.
The attention score heatmap in Figure 10 reveals that regions with high attention predominantly consist of nuclei-rich areas, while regions with low attention are primarily composed of muscle or fat tissue.Hence, a more comprehensive analysis at the nuclear level is crucial for the areas exhibiting high attention.
Ultimately, as demonstrated in Table 4, the precision and accuracy of ablation2 and ablation3 surpass those of ablation1, underscoring the effectiveness of patch-level or nuclei-level characteristics in constraining slide-level features.However, the sensitivity (SE) of ablation3 is lower than that of ablation1, indicating that the detection of WSI carcinomatosis solely through handcrafted restrictions at the nuclear level is insufficient.
Ablation2 outperforms ablation3 in terms of accuracy (AC), sensitivity (SE), and precision (PC), suggesting that patch-level deep learning features (1024 dimensions) offer more constraint information than nuclei handcrafted features (16 dimensions), albeit with marginal improvement.This analysis also fully illustrates that although deep learning methods can extract features of any dimension theoretically, the dimension of features is not the higher, the better, but rather to extract features with discriminative power.Finally, the amalgamation of nuclei-level and patch-level features enhance the constraint on slide-level features, thereby improving the accuracy of cancerous WSI classification.

Comparative analysis of different methods
The performance of various comparison methods on four test samples is presented in Table 5.For WSIs with clear imaging and distinct features (WSI-1, WSI-2), except for TransMIL's misclassification of WSI-1 as cancerous, other methods correctly classify WSI-1 with high probability.Figure 11C reveals that areas with high attention scores exhibit a complex texture comprising red blood cells, muscle cells, lymphocytes, epithelial cells, etc.Aside from DTFD-MIL, other methods assign higher attention to these areas, and the attention heatmap of FRMIL and ReMix is identical, indicating the shared use of the same attention framework.
NPKC-MIL accurately classifies WSI-3 and WSI-4.TransMIL misclassifies WSI-3 as normal but correctly classifies WSI-4.FRMIL incorrectly classifies WSI-3 as cancerous WSI, aligning with its assumption that ''cancerous WSIs have more information than normal ones.''CLAM-SB misclassifies WSI-3 and WSI-4, suggesting that for WSIs with complex textures, meso features alone may not sufficiently constrain classification results based on macro features.In comparison to NPKC-MIL, ReMix correctly classifies WSI-3 with a higher probability (0.9466), and DTFD-MIL accurately classifies WSI-4 with a higher probability (0.9022), highlighting the simplicity and effectiveness of data enhancement for small sample data.
4][35] The results highlight that TransMIL, by treating patches as relevant entities and considering their spatial topological information through the Transfer framework, achieves an accuracy of 88.75% and a sensitivity of 90.00%.While FRMIL successfully adjusted the distribution of instances, its underlying assumption that ''cancerous WSIs contain more information than normal ones'' led the model to prioritize the features of cancerous WSIs, consequently diminishing its ability to recognize features in normal WSIs.This bias resulted in a lower specificity (67.5%) and a higher sensitivity (85%).Within the MIL framework, CLAM-SB employs a clustering concept and adds patch-level constraints with a few representative instances, achieving a sensitivity of 90.00%.
The generalization ability of deep learning models is significantly influenced by the diversity of training samples.ReMix proposed the ''mixthe-bag'' data enhancement method, which is simple but effective, achieving an accuracy of 91.43%.DTFD-MIL uses pseudo-bags technology for data enhancement, but the child bags derived from the positive parent bags do not always contain positive instances, in which case will introduce a bag with a wrong label, and the feature distillation branch structure used in their study cannot eliminate the artificially introduced errors well, leading to a low SE with 80%.
It is noteworthy that all the aforementioned methods rely on slide-level and patch-level deep learning features from transfer learning models, neglecting micro-level prior knowledge (handcrafted features).In contrast, NPKC-MIL maximizes the potential of deep learning in feature extraction while incorporating meaningful manual features.This approach not only builds interpretable deep learning models but also achieves higher accuracy, specificity, and sensitivity, respectively.PrePro is an abbreviation for ''predict probability'', PreCls is an abbreviation for ''predict class''.The image classification process involves a mathematical exploration to identify the mapping relationship between image features and target types.Traditional methods, rooted in feature engineering, rely on effective and comprehensive handcrafted features.In simpler scenarios, the objective function can be determined through sparse regression, and parameters can be solved using least squares.However, for the challenging task of WSI classification, low-level handcrafted features (including color features, texture features, morphological features, etc.) prove insufficient in capturing the complexity of medical images.As illustrated in Table 4, the impact of 16-dimensional handcrafted feature constraints is inferior to that of deep learning features, underscoring the richer constraint information provided by deep learning features.Additionally, for WSIs exhibiting strong heterogeneity, modeling complex scenes through symbolic mathematical methods poses challenges due to the optimization space's intricate structures and coefficients, making it arduous for optimization algorithms, such as simulated annealing and genetic algorithms, to find solutions.Lastly, many traditional feature-based engineering algorithms necessitate loading all data into memory simultaneously, which is impractical for WSIs with gigapixels.
Why should we pay more attention to prior knowledge at the nucleus level?
Different from parametric models that can be explicitly expressed through mathematical formulas, deep learning operates as a statistical modeling method with unknown structures and coefficients in its equations.It is a data-driven approach optimized by gradient descent algorithms.While the convolution operation can theoretically extract features of any dimension, the lack of interpretability has hindered widespread acceptance among pathologists.As depicted in Figure 12, WSIs exhibit strong heterogeneity and diversity.Various factors, such as different image acquisition devices and variations in the H&E staining process before scanning, can contribute to distribution drift in WSIs.Consequently, extracting nuanced prior knowledge from macro and meso perspectives becomes challenging.PrePro is an abbreviation for ''predict probability'', PreCls is an abbreviation for ''predict class''.

ll OPEN ACCESS
As shown in Figure 13, the nucleus is a concrete object that can be quantitatively described.We can extract various attributes of the nucleus itself and analyze the interactions between nuclei.As indicated in Figure 10, the regions densely populated with nuclei coincide with areas of higher attention scores.Therefore, to enhance the interpretability of the model and broaden the dimension of feature analysis, it is necessary to extract multidimensional prior knowledge of nuclei and conduct quantitative analysis.

How to incorporate prior knowledge into the AI model?
In the construction progress of deep learning modeling, the integration of prior knowledge can occur in three key phases: (1) During the initial design of the model structure, the relationship between differentiation and convolution can be harnessed to shape the convolutional kernel.The network structure can then be tailored using principles from the finite element method.(2) Handcrafted features can be introduced to the deep learning features at the bottom layer to augment feature dimensionality, followed by forward propagation layer by layer.The addition of tens of dimensions of handcrafted features to low-level deep learning features with thousands of dimensions may not necessarily enhance model performance.However, incorporating handcrafted features at the bottom layer becomes viable when the dimensionality of these features is sufficiently high.( 3) At the top level of the model, the constraints of handcrafted features (prior knowledge) on the model are realized by adding penalty terms to the loss function.In this paper, we realize the constraints of handcrafted features on the classification results by adding a penalty term L nuclei to the model loss function.

Conclusion
Recognizing the limitations of the current breast histopathology WSIs classification algorithm, which solely relies on slide-level and patch-level features while overlooking nuclei-level features, this paper introduces a NPKC-MIL for breast histopathology image classification.NPKC-MIL leverages the advantages of convolutional neural networks in model fitting while also imbuing the model with clear physical significance by the introduction of constraints based on handcrafted features.Experimental results demonstrate that, compared to state-of-the-art methods, the proposed approach exhibits superior specificity, sensitivity, accuracy, and precision in the classification of breast histopathology WSIs.This offers valuable insights for expanding the analytical scope of current WSI classification tasks and incorporating prior knowledge (handcrafted features) into deep learning models.v j ˛fwjdistðv i ; wÞ % d K ^distðv i ; wÞ < d min ; cv i ; w ˛V g (Equation 20) where distðv; wÞ is the distance function, this paper takes Euclidean distance; d K is the nearest neighbor of K in distðv;wÞ; d min is the nearest neighbor threshold.In this experiment, K takes 5 and d min takes 50.
After constructing the node topology graph, extract 16-dimensional handcrafted features such as geometric and texture features of nodes according to the method described in the principle of nuclei-level feature design section.Finally, according to graph convolution theory 50 and node message passing mechanism, 51 node messages are aggregated and updated according to Equation 21. (Equation 22) where 5 is the tensor product, I is the identity matrix, D r3r is the degree matrix of nodes, and r is the number of nodes in a patch; a is the scaling factor, shrinking is negative, magnifying is positive; m, s, max , min are the mean value, standard deviation, maximum value and minimum value of node features respectively; S is the degree scaling matrix and meets: SðD r 3 r ; aÞ = logðD r 3 r +IÞ d a (Equation 23) where d is the normalized index calculated based on training data, and satisfy: logðD r 3 r ði; jÞ + 1Þ (Equation 24) where T n is the number of nodes for each input of training data.

Patch-level feature extraction and slide-level feature construction
The ResNet 50 network model, 52 pre-trained on ImageNet, serves as the patch-level feature extractor in our experiment.Following the third residual module of the network, we apply global average pooling to transform the input patch of dimensions 512 3 512 into a 1 3 1024 output feature vector, which is then aggregated into slide-level features.Initially, all patches are assigned weights based on the multi-head attention mechanism, as per Equation 25.Subsequently, we select c samples (c = 8) with the highest attention values as the training samples for the nuclei classifier.Finally, the attention pooling method is employed to aggregate patch-level features into slide-level characteristics, as indicated by Equation 26.25) s i p i (Equation 26) where s i is the weight of the i-th patch(i = 1,2, ., N); p i is features vector of the i-th patch; f slide is the slide-level features vector.SA a , SA b indicates the multi-head attention, and meets: SAðQ; K; VÞ = softmax QK u ffiffiffiffiffi ffi d K p V (Equation 27) e z i (Equation 28) sofamx is the normalized function.QK u = ffiffiffiffiffi ffi d K p is the self-attention matrix, performing linear transformations K = W K X, Q = W Q X, V = W V X on the input image X to obtain Q (Queries), K (Keys), V (Value); W Q , W K and W V are the learnable transformation matrix; tðxÞ, xðxÞ are the activation function, and meets: tðxÞ = 1 1+e À x ; xðxÞ = e x À e À x e x +e À x (Equation 29) Classify WSIs based on nuclei-level handcrafted feature constraint The attention mechanism serves to identify potential ROI (Regions of Interest, ROIs).However, it's important to note that a region with a high attention score may not necessarily indicate a diseased area, and conversely, a region with a low attention score may not always be normal.This implies that selecting Top-K ROIs based on attention scores introduces noise. 56To address this, this paper employs the noise-robust APL loss function 53 for both patch-level and nuclei-level, as per Equation 30, and utilizes the cross-entropy error function from Equation 31 for slide-level losses.The slide-level losses are then constrained by incorporating patch-level and nuclei-level losses as penalty terms to formulate the optimization objective function in Equation 32.Finally, a WSI classification model is constructed by approximating the minimum value of Equation 32 using the Adam optimizer. 54patch;nuclei = log Y K k = 1 pðkjxÞ pðyjxÞ À X K k = 1 pðkjxÞlog À qðkjxÞ Á (Equation 30) 31) 32) where qðkjxÞ is the truth distribution; pðkjxÞ is the predicted value distribution; L slide , L patch , L nuclei , and L total are slide-level, patch-level, nuclei-level, and total losses, respectively.In this paper, let w 1 = 0.7, w 2 = 0.2, w 3 = 0.1.

Evaluation criteria
The commonly used classification evaluation criteria are accuracy (AC), specificity (SP), which measures an algorithm's ability to recognize normal WSIs, sensitivity (SE), also known as recall rate, which measures the power of the algorithm to identify cancerous WSIs, and precision (PC).Giving GT (Ground Truth) as the truth label, donate normal WSI as GT = 0 and cancerous one as GT = 1.Let PreCls (Predicted Class) be the predicted result of the algorithm, then:

QUANTIFICATION AND STATISTICAL ANALYSIS
All statistical details, such as accuracy and sensitivity, can be located in the results, methods, and/or figure legends.

Figure 1 .
Figure 1.Flowchart of NPKC-MIL (A) Extracting patch feature by transfer learning.(B) Ranking the importance of patches by attention score.(C) Calculating slide-level loss.(D) Calculating patch-level loss.(F) Constructing total loss function.

Figure 5 .
Figure 5. Schematic diagram of the geometric features of the nuclei in part

Figure 7 .
Figure 7. Mean nuclei contrast and entropy of five randomized trials (A) The contrast difference.(B) The entropy difference.

Figure 8 .
Figure 8. Tissue extraction and patching (the green line indicates the tissue contour, and the small rectangular box in step B are sliding window of 5123 512 pixels to extract patches for training) (A) Extracting tissue.(B) Slicing WSI into patches.

Figure 9 .
Figure 9.The misclassified test samples and their macro, meso, and micro characteristics diagram (the yellow circle is the nuclei centroid, the black circle is the nuclei contour, and the blue lines represent the adjacency relationship of nuclei)

Figure 10 .
Figure 10.Attention heatmaps of the four test samples (yellow boxes for high attention score areas, red boxes for low attention score areas)

Figure 11 .
Figure 11.Attention score heatmaps of different comparison methods on WSI-3 and WSI-4 (yellow boxes are areas of high attention score) (A) Attention score heatmaps of different methods on WSI-3.(B) Attention score heatmaps of different methods on WSI-4.(C) Zoomed diagram of the yellow box in (A) and (B).

Figure 13 .
Figure 13.Nuclei in micro perspectives (the yellow circle is the nuclei centroid, the black circle is the nuclei contour, and the blue lines represent the adjacency relationship of nuclei)

2 4 I
where h ðtÞ ðvÞ and h ðt+1Þ ðvÞ are the feature vectors of the current layer and the next node layer v, respectively.h ðtÞ u ˛NðvÞ ðuÞ is the adjacent node feature vectors of the node v; U, M are multi-layer perceptron; NðvÞ is the adjacency matrix of the node v; 4 is a multi-scale scaling and aggregation operator that satisfies: 4 = SðD r 3 r ; a = 1Þ SðD r 3 r ; a = À 1Þ where TP (True Positive) means GT = 1 and PreCls = 1; TN (True Negative) means GT = 0 and prelude = 0; FN (False Negative) means GT = 1 and PreCls = 0; FP (False Positive) means GT = 0 and PreCls = 1.

Table 1 .
Ratio of handcrafted features between cancerous nuclei and normal ones

Table 2 .
Construction of experimental datasets

Table 3 .
Performance of different ablation experiments on four test samples

Table 4 .
Comparison table of ablation experiments

Table 5 .
The performance of different comparison methods on four test samples

Table 6 .
Comparison of evaluation criteria of different algorithms Figure 12.Strong heterogeneity of WSI from macro and meso perspectives