1 Introduction

Anomaly detection (AD) [1, 2] aims to detect anomalous samples that are deviated from a set of normal samples predefined during training. Traditional image anomaly detection adopts a semantic AD setting [3,4,5,6], where anomaly samples are from unknown semantic classes different from the one normal samples belong to. Recently, detecting and localizing subtle image anomalies has become an important task in computer vision with various applications, such as anomaly or defect detection in industrial optical inspection [7, 8], anomaly detection and localization in video surveillance [9,10,11], or anomaly detection in medical images [12, 13]. In this setting, anomaly detection determines whether an image contains any anomaly, and anomaly localization, aka anomaly segmentation, localizes the anomalies at the pixel level. This paper focuses on the second setting, especially industrial anomaly detection and localization. Some examples from the MVTec AD dataset [8] along with predictions by our method are shown in Fig. 1.

Fig. 1
figure 1

Examples from the MVTec benchmark datasets. From top to bottom: anomaly samples, anomaly mask, and anomaly score maps predicted by our method

In the above applications, anomalous samples are scarce and hard to collect. Therefore, image anomaly detection and localization are often solved with only normal samples. In addition, anomalous regions within images are often subtle (see Fig. 1), making image anomaly localization a more challenging task that has not been thoroughly studied compared to image anomaly detection. Recent anomaly localization methods can be roughly categorized into two classes: reconstruction-based methods and OOD-based (out-of-distribution based) methods.

Reconstruction-based methods are mainly based on the assumption that a model trained only on normal images can not reconstruct anomalous images accurately. They reconstruct image as a whole [8, 12, 14,15,16,17,18,19,20,21], or reconstruct in the feature space [22,23,24]. Then anomaly detection and localization can be performed by measuring the difference between the reconstructed and original ones. This kind of method always needs cumbersome network training.

OOD-based methods evaluate the degree of abnormality for a patch feature by measuring its deviation from a set of normal patch features, which is intrinsically a patch-wise OOD detecting task. Some methods such as PatchSVDD [25] and CutPaste [26] learn feature representation by self-supervised learning. On the contrary, some other methods [27,28,29,30] simply extract features by deep networks pre-trained on natural image datasets such as ImageNet [31], and achieve promising and even better performances. Since the number of training patches is much larger than that of training images, the inference time and storage increase remarkably. Different strategies have been proposed to tackle this problem. Napoletano et al. [27] used k-means to learn the dictionary/prototypes for normal patch features, but they evaluated each test patch independently, resulting in high inference time. SPADE [28] selects k-nearest normal images for patch-wise evaluation based on the global image features, limiting anomaly localization performance. PaDiM [29] models the normal patches at each position by a multidimensional Gaussian distribution and measures the anomaly by the Mahalanobis distance between a test patch feature and the Gaussian at the same position. However, both SPADE [28] and PaDiM [29] are reliant on image alignment. The current state-of-the-art method, PatchCore [30], uses greedy coreset subsampling to reduce the inference time and storage significantly.

This paper proposes ProtoAD, a prototype-based neural network for image anomaly detection and localization, to improve OOD-based methods’ inference speed. We assume that all normal patch features can be grouped into some prototypes, and abnormal patch features cannot be properly assigned to any of them. Therefore, image anomaly localization can be performed by measuring the deviation of test patch features from the prototypes of normal patch features. First, the patch features of normal images are extracted by a deep network pre-trained on nature images and are L2-normalized. Then the prototypes of the normalized normal patch features are learned by a non-parametric clustering algorithm. The cosine similarity between two L2-normalized vectors is equivalent to the dot product between them. Therefore the cosine similarity between a normalized patch feature and a prototype can be implemented by a \(1\times 1\) convolution. Based on this equivalence, we construct an image anomaly localization network (ProtoAD) by appending the feature extraction network with the L2 feature normalization, a \(1\times 1\) convolutional layer, a channel max-pooling, and a subtraction operation. We use the prototypes as the kernels of the \(1\times 1\) convolutional layer; therefore, our neural network does not need a training phase. Compared with previous OOD-based methods [27,28,29,30], ProtoAD can perform the anomaly detection and localization in an end-to-end manner, which is more elegant and efficient. Extensive experiments on two challenging industrial anomaly detection datasets, MVTec AD [8] and BTAD [32], demonstrate that ProtoAD achieves competitive performance compared to the state-of-the-art methods with a higher inference speed. This advantage of ProtoAD makes it better match the needs of real-world industrial applications.

2 Related Works

2.1 Image Anomaly Localization

Anomaly detection is an image-level task to determine whether an image contains any anomaly. On the other hand, anomaly localization is more complex to locate anomalies at the pixel level. Here, we only introduce the methods that can be directly applied to image anomaly localization and roughly categorize current methods into two types: reconstruction-based and OOD-based.

Reconstruction-based methods are mainly based on the assumption that a model trained only on normal images can not reconstruct anomalous images accurately, and anomaly detection and localization can be performed by measuring the difference between the reconstructed and original images. Early reconstruction-based methods [8, 12, 14, 15, 17] reconstruct image by auto-encoders (AE), variational autoencoders (VAE) or generative adversarial networks (GAN). However, the neural networks have high generalization capacities and can reconstruct anomalies well. Later, different strategies have been proposed to tackle this problem. Different memory-based auto-encoders [16, 18, 20] have been proposed to reconstruct images with features from memory bank to limit the generalization ability. Student-teacher models [22, 23] have been used to reconstruct pre-trained deep features. RIAD [19] randomly removes partial image regions and reconstructs the image by image in-painting. Glance [24] trains a Global-Net to regress the deep features of cropped patches based on their context. DRAEM [21] combines a reconstructive sub-network and a discriminative network and trains them in an end-to-end manner on synthetically generated just-out-of-distribution images.

OOD-based methods evaluate the degree of abnormality for a patch feature by measuring its deviation from a set of normal patch features, which is intrinsically a patch-wise OOD detecting task. Some methods such as PatchSVDD [25] and CutPaste [26] learn feature representation by self-supervised learning. On the contrary, some other methods [27,28,29,30] simply extract features by deep networks pre-trained on natural image datasets such as ImageNet [31], and achieve promising and even better performances. Since the number of training patches is much larger than that of training images, the inference time and storage increase remarkably. Different strategies such as clustering, density estimation, and sampling have been proposed to tackle this problem. Napoletano et al. [27] learned a dictionary of normal patches from the training set by k-means, and evaluated each patch of a test image by measuring its visual similarity with the k-nearest neighbors in the dictionary. SPADE [28] compares patch features of a test image with the patch features at the same position of k-nearest normal images selected based on global image features. However, this oversimplified pre-selection strategy will limit the localization performance. PaDiM [29] models the normal patches at each position by a multidimensional Gaussian distribution and detect anomaly by the Mahalanobis distance between a test patch feature and the Gaussian at the same position. Both SPADE [28] and PaDiM [29] are reliant on image alignment. Recently, PatchCore [30] constructs the memory bank of locally aware patch features by greedy coreset subsampling, and localizes anomaly by measuring the distances of test patch features to their nearest normal patch features in the bank. As a result, PatchCore achieves a new state-of-the-art and significantly reduces the inference time and storage.

Our method is also an OOD-based method with pre-trained deep features but has several differences from the previous works. Our method uses non-parametric clustering instead of k-means in [27] to learn the prototypes for normal patch features. More importantly, our method can perform anomaly detection and localization by a network in an end-to-end manner, which is more elegant and efficient than the previous methods. Compared to reconstruction-based methods, our network do not need a cumbersome network training phase.

2.2 Clustering Algorithms

Clustering is a type of unsupervised learning task of dividing a set of unlabeled data points into a number of groups such that the data points in the same groups are more similar to each other than they are to the data points in other groups. Clustering provides an abstraction from data points to the clusters, and each cluster can be characterized by a cluster prototype, such as the centroid of a cluster, for further analysis. Clustering algorithms can be roughly divided into four categories: Partition-based cluster, Density-based clustering, Spectral Clustering, and Hierarchical-based clustering.

Partition-based clustering algorithms divide the data into k groups, where k is the predefined number of cluster. The classical algorithms are k–means [33] and its variations. Although these algorithms are very fast, they need the number of clusters as a parameter and are sensitive to the selection of the initial k centroids.

Density-based clustering defines a cluster as the largest set of densely connected points and can find clusters of arbitrary shapes. DBSCAN [34] is the most representative algorithm of this class. It has two parameters, radius length \(\epsilon \) and a parameter MinPts. If there are MinPts points in the radius of \(\epsilon \) of a point, it is regarded as a high-density point.

Spectral Clustering [35] has recently attracted much attention. Most spectral clustering algorithms need to compute the full similarity graph Laplacian matrix and have quadratic complexities, thus severely restricting their application to large data sets.

Hierarchical clustering [36] is of two types: bottom-up and top-down approaches. In the bottom-up approach (aka agglomerative clustering), each data point starts as a cluster, and the most similar cluster pairs are iteratively merged according to the chosen similarity measure until some stopping criteria are met. In the top-down approach (aka divisive clustering), the clustering begins with a large cluster including all data and recursively breaks down into smaller clusters. Hierarchical clustering produces a clustering tree that provides meaningful ways to interpret data at different levels of granularity. Recently, Sarfraz et al. [37] proposed FINCH, a high-speed, scalable, and fully parameter-free hierarchical agglomerative clustering algorithm.

Fig. 2
figure 2

An overview of the proposed method. First, the patch features of normal images are extracted by a deep network pre-trained on nature images. Then, the prototypes of the normal patch features are learned by FINCH clustering. For inference, an image anomaly localization network (ProtoAD) is constructed by appending the feature extraction network with the L2 feature normalization, a \(1\times 1\) convolutional layer, a channel max-pooling (CMP), and a subtraction operation, and anomaly localization is performed in an end-to-end manner

In [27], k-means is used to learn the prototypes from normal patch features. To avoid choosing the number of clusters ahead, we adopt FINCH to learn the prototypes for normal patch features.

3 Method

Our method consists of three steps: patch feature extraction, prototype learning, and anomaly detection and localization. An overview of our method is given in Figure. 2. We describe them sequentially in the following subsection.

3.1 Patch Feature Extraction

Since the features extracted by pre-trained networks have shown their effectiveness for various visual applications including anomaly detection [22, 23, 27,28,29,30], we also adopt deep networks pre-trained on ImageNet dataset [31] as the feature extractor, and choose the backbone of Wide-ResNet [38] as the feature extractor following the previous works [28,29,30].

ResNet-like deep networks [38, 39] include several convolutional stages. The features become more abstract when the stage goes deeper, but their resolution gets lower. Thus, the feature maps from different stages form a feature hierarchy for an input image. Each spatial position of a feature map has a receptive field and corresponds to a patch/region in an input image; therefore, the feature vector at a spatial position of feature maps can be considered as a feature representation for the corresponding image patch. If the feature maps of a stage have a resolution of \(H\times W\), they contains \(H\times W\) patch features. The deep and abstract features from the ImageNet pre-trained networks are biased towards the ImageNet classification task and are less relevant to the anomaly detection and localization task. Therefore, we adopt the low- and mid-level (stage 1–3) feature representations and combine them as the patch features. Concretely, the feature maps at the higher-level are bilinearly re-scaled to have the same resolution as the lowest level, then the feature maps at different levels are concatenated together for handling multi-scale anomalies. The extracted features are then L2-normalized where each feature vector is divided by its L2 norm.

3.2 Prototype Learning

After feature extraction, the prototypes of the L2-normalized patch features are learned by a clustering algorithm. Then, the prototypes are used in anomaly detection and localization instead of all the normal patch features to reduce the inference time and storage. There are mainly two concerns in choosing a clustering algorithm. First, the number of patch features is much larger than that of training images. For example, each category of MVTec AD dataset has several hundreds of images, while it has several hundreds of thousands of patch features in our implementation. Therefore, the clustering algorithm should be efficient and scalable to large-scale data. Second, most clustering algorithms have some parameters, e.g., the number of clusters or distance thresholds, which can not be well set without a priori knowledge of the data distribution. Thus, these algorithms demand a tedious parameter tuning process to achieve good performance. To meet the requirements of real applications, we adopt FINCH [37], a high-speed, scalable, and fully parameter-free hierarchical agglomerative clustering algorithm.

The core idea of FINCH is to use the nearest neighbor information of each data point for clustering, which does not need to specify any parameters and has a low computational overhead. Given the integer indices of the first neighbor of each data point, an adjacency matrix is defined according to the following rules:

$$\begin{aligned} A(i,j)=\left\{ \begin{aligned} 1,&\text { if }j=\kappa _i^1 \text { or } \kappa _j^1=i \text { or } \kappa _i^1=\kappa _j^1 \\ 0,&otherwise \end{aligned} \right. \end{aligned}$$
(1)

where \(\kappa _i^1\) symbolizes the first neighbor of data point i. This sparse adjacency matrix specifies a graph where connected data points form clusters. It directly provides clusters without solving a graph segmentation problem. After computing the first partition, FINCH merges the clusters recursively by using cluster means to compute the first neighbor of each cluster until all data points are included in a single cluster or until some stopping criteria is met. In this work, we define the stopping criteria as the number of cluster is less than a threshold and set the threshold to 10,000 to get good results in our experiments. We choose the last partition as the clustering result, and use the mean vectors of clusters as the prototypes of normal patch features.

When the features are L2-normalized (making the length of a vector to 1), cosine similarity and Euclidean distance between the normalized features are equivalent in the sense of nearest neighbor searching:

$$\begin{aligned} \begin{aligned} \frac{1}{2}{L_2(\textbf{x}_a,\textbf{x}_b)}^2 = \frac{1}{2}(\textbf{x}_a-\textbf{x}_b)\cdot {(\textbf{x}_a-\textbf{x}_b)} = 1-\textbf{x}_a\cdot \textbf{x}_b=1-\cos {(\textbf{x}_a,\textbf{x}_b)} \end{aligned} \end{aligned}$$
(2)

where \(L_2()\) is Euclidean distance, \(\textbf{x}_a\) and \(\textbf{x}_b\) are two L2-normalized feature vectors, and \(\cos \) is cosine similarity. Therefore, we use cosine similarity for clustering and measuring the deviation of test patch features from norm patch features in the next subsection.

3.3 Neural Network for Anomaly Detection and Localization

When a test image passes through the feature extraction network, \(H\times W\) patch features have been extracted. The anomaly score of each patch feature can be computed by measuring its deviation from the prototypes of normal patch features. We compute the anomaly score of a test patch as one minus the cosine similarity between the normalized test patch feature and its nearest prototype. Formally, the anomaly score for the patch at position (ij) can be calculated as

$$\begin{aligned} s_{ij} = 1 - \max \limits _{1\leqslant k \leqslant K}\cos {(\textbf{x}_{ij}, \textbf{m}_k)} \end{aligned}$$
(3)

where \(\textbf{x}_{ij}\) is the normalized patch feature at position (ij), \(\textbf{m}_k\) is the k-th prototype, and \(\cos \) is cosine similarity. In addition, the image-level anomaly score for a test image can be simply computed by maximizing the anomaly scores of all its patch features.

$$\begin{aligned} S = \max \limits _{1\leqslant i \leqslant H, 1\leqslant j \leqslant W} s_{ij} \end{aligned}$$
(4)

The cosine similarities between a normalized patch feature and a prototype can be computed by a \(1\times 1\) convolution (dot product) between them. Based on this equivalence, we construct a neural network (ProtoAD) for anomaly detection and localization. First, the L2 feature normalization and a \(1\times 1\) convolutional layer are appended to the feature extraction network, and outputs feature maps of size \(H\times W \times K\), including the cosine similarities between the \(H\times W\) normalized patch features and all K prototypes. Then, channel max-pooling (CMP) is applied to the feature maps to get the normal score map of \(H\times W\), including the cosine similarities between the \(H\times W\) normalized patch features and their nearest prototypes. The anomaly score map can be further obtained by computing one minus the normal score map. This process is illustrated by Fig. 3. Since the spatial resolution of feature maps is lower than that of an input image, we resize the anomaly score map to the resolution of the input image and use a Gaussian filter to smooth it. Finally, anomaly localization can be achieved by thresholding the anomaly score map, and the anomaly score for the test image can be obtained by maximizing the anomaly score map.

We use the prototypes of normal patch features as the kernels of the \(1\times 1\) convolutional layer. Therefore the proposed neural network does not need a training phase. Compared to previous works [27,28,29,30], our method can perform the anomaly detection and localization in an end-to-end manner, which is more elegant and efficient.

Fig. 3
figure 3

Anomaly detection and localization process of ProtoAD

4 Experiments

4.1 Datasets and Metrics

4.1.1 Dataset

MVTec AD dataset [8] is a real-world industrial defect detection dataset which has become a standard benchmark for evaluating image anomaly detection and localization methods. It has 5354 high-resolution images belonging to 10 objects and 5 texture categories. The images of each category are split into a training and a testing set. Totally, the training set has 3629 normal images, and the test set has 1725 normal and abnormal images of various defects. The ground truth of the test set contains anomaly labels for image-level evaluation and anomaly masks for pixel-level evaluation.

BTAD (BeanTech Anomaly Detection dataset) is a real-world industrial dataset recently released by [32]. It contains a total of 2830 real-world images of 3 industrial products. The images of each category are split into a defect-free training set and a testing set, supporting evaluation of both anomaly detection and localization.

We follow the split of the two datasets for training and testing.

4.1.2 Evaluation Metrics

AUROC (Area Under the Receiver Operating Characteristic curve) is the most commonly used metric for anomaly detection, which is independent of the threshold. We use image-level AUROC for evaluating the performance of anomaly detection, pixel-level AUROC for anomaly localization. Since the pixel-level AUROC is biased in favor of large anomalies, we also use PRO-score (per-region-overlap) [22] to evaluate anomaly localization, which weights ground-truth regions of different sizes equally.

4.2 Experimental Setup

We normalize the size of images from all categories of MVTec AD and BTAD dataset to \(256\times 256\), center crop images to \(224\times 224\), and do not apply any data augmentation. The backbone of Wide-ResNet50 pre-trained on ImageNet is employed as the feature extractor in our method as in [28,29,30]. We define the stopping criteria for FINCH clustering algorithm as the number of clusters is less than 10,000 and choose the last generated partition as the clustering result. For inference, we up-sample the anomaly score map to image size using bilinear interpolation and smooth it with the Gaussian filter with parameter \(\delta =4\) as in [29]. We implemented our models in Python 3.7 [40] and PyTorch [41], and run experiments on NVIDIA GeForce RTX 2080 Ti.

4.3 Results on MVTec AD

4.3.1 Comparison with the State-of-the-art

We compare ProtoAD with the state-of-the-art methods including both the reconstruction and OOD-based methods. The compared reconstruction-based methods include Uninformed students (U-Student) [22], RIAD [19], MKD [23], Glance [24], DAAD [20] and DREAM [21]. And the compared OOD-based methods include SPADE [28], PatchSVDD (P-SVDD) [25], CutPaste [26], PaDiM [29], and PatchCore (P-Core) [30]. We directly use their evaluation results if they have been provided.

Table 1 Anomaly localization performance on MVTec AD (Pixel-level AUROC)
Table 2 Anomaly localization performance on MVTec AD (PRO-score)

We report the evaluation results (pixel-level AUROC and PRO-score) for pixel-level anomaly localization on MVTec AD dataset in Tables 1 and 2 respectively. From table 1, we can see that the OOD-based methods generally achieve better pixel-level AUROC than the reconstruct-based methods. Among the OOD-based methods, the methods using the pre-trained deep features achieve better pixel-level AUROC than the methods based on self-supervised learning. PatchCore achieves the best pixel-level AUROC, PaDiM the second, and the reconstruct-based method DREAM the third. The pixel-level AUROC of our method is very close to those of PaDiM and DREAM. We also notice that our method is more effective on the texture category and achieves the second best AUROC. Table 2 gives the PRO-score results for methods which have used this metric. Among them, Glance achieves the best result, our method is the second best and outperform other OOD-based methods. After all, our method achieves competitive anomaly localization performance to the state-of-the-art methods.

Figure 4 gives qualitative anomaly localization results of our method on MVTec AD dataset. We can see that our method can give accurate pixel-level localization regardless of anomaly region size and type (see supplementary for more qualitative results).

We also report the image-level AUROC results for anomaly detection in Table 3. PatchCore achieves the best AUROC again, DREAM the second. Our method remains competitive and achieves the third-best AUROC, which is very close to that of DREAM.

4.3.2 Inference Efficiency

Anomaly detection and localization algorithms need high precision and inference speed to match the requirements of real-world applications. Thus, we also report the inference speed of our method and previous OOD-based methods using pre-trained deep features [28,29,30] in the table 4. In the experiments, all the methods adopt Wide-ResNet50 pre-trained on ImageNet as the feature extractor, center-cropped \(224\times 224\) image as input, and run on the same machine with a NVIDIA GeForce RTX 2080 Ti. For PatchCore, we use the implementation provided by the authors, which downsamples the normal patch features via greedy coreset subsampling (PatchCore-\(x\%\) denotes the percentage x of normal patch features are used in inference) and uses faiss [42] for nearest neighbor retrieval and distance computations. For PaDiM, we make extensive optimization via GPU acceleration. Compared with the previous methods, our model achieves the highest speed, which is 1.2x, 2.7x, and 9.5x faster than PaDiM, PatchCore, and SPADE, respectively. The high inference speed is mainly because our model performs inference in an end-to-end manner, and the main computation added to the feature extraction network is the \(1\times 1\) convolutional layer. Compared to the reconstruct-based methods, our method does not need a cumbersome network training process.

Table 3 Anomaly detection performance on MVTec AD (Image-level AUROC)
Table 4 Comparison of inference speed. Scores includes image-level AUROC, pixel-level AUROC, and PRO-score
Fig. 4
figure 4

Qualitative anomaly localization results of our method. From top to bottom: abnormal images, ground-truth, and anomaly score maps produced by our method

4.4 Ablation Study

We report ablations studies on the MVTec AD dataset to evaluate the impact of different components of our method on the performance.

Table 5 Anomaly detection and localization performance of ProtoAD with features at different levels. Each tuple shows image-level AUROC and pixel-level AUROC

4.4.1 Feature Layer Selection

ResNet-like deep networks [38, 39] include several convolutional stages. The feature maps from different stages can compose a feature hierarchy for an image. Since the deepest feature maps in the hierarchy are biased towards the ImageNet classification task, we only adopt the features at the low and middle hierarchy levels (stage 1–3) for anomaly detection and localization. Table 5 gives the performance achieved with the features from different levels and their combination. It can be observed that the features from hierarchy level 2 can achieve the best performance among the first three levels, and a combination of the three levels can further improve the performance. Therefore, our method uses the combination of the first three feature levels as the patch feature.

4.4.2 Partition Selection from Clustering Hierarchy

FINCH is a hierarchical agglomerative clustering algorithm. It recursively merges clusters from the bottom up and provides a set of partitions in a hierarchical structure. Each successive partition is a super-set of its preceding partitions, and the number of clusters in it is smaller than those in the preceding partitions. Thus, we need select a partition from the clustering hierarchy as the clustering result.

We report the performance of our method with different partitions, from the second (P2) to the 6-th (P6) partition of FINCH, in Table 6 (see Table 1 in supplementary for more detailed results). We do not include the first partition because it has a huge number of clusters. The results in Table 6 indicate the average performance decreases along with the merging process. This may be because, when the number of clusters gets smaller, clusters are less compact and unsuitable for anomaly detection. On the other hand, if the number of clusters is too large, there are too many prototypes, and the inference time and storage would increase rapidly. We also give the “Best” performance, which FINCH can achieve by selecting the best partition for each category respectively. This best performance is the upper bound that our method can achieve. However, selecting partition based on the average performance (from P2 to P6) or performance for each category (Best) is time-consuming and not suitable for real applications. In our method, we stop FINCH when the number of cluster is less than 10,000 and use the final partition as the clustering result, and give its results in the last line of Table 6. Our partition selection rule can achieve performance very close to the best one with only a tenth of clusters. Therefore, our method can reach a good trade-off between effectiveness and efficiency.

Table 6 Anomaly detection and localization performance of ProtoAD with different FINCH partitions. Each tuple shows image-level AUROC, pixel-level AUROC, and average cluster numbers
Table 7 Anomaly detection and localization performance of ProtoAD with different clustering methods. Each tuple shows image-Level AUROC and pixel-level AUROC
Table 8 Anomaly detection and localization performance on BTAD (Image-level and Pixel-level AUROC). The best results are bold-faced

4.4.3 FINCH vs. K-Means

We compare FINCH clustering algorithm with k-means for the prototype-based anomaly detection. In our method, we choose the partition generated so far by FINCH which having less than 10,000 clusters as the clustering result. For a fair comparison, we set k to 10,000 for k-means. The results in table 7 indicate that the method based on FINCH (the third column) achieves better performance than that based on k-means (the first column). Although it may achieve better performance for k-means by tuning k, it is time-consuming and not feasible for real applications.

4.4.4 Feature Normalization and Cosine Similarity

We also explore the importance of feature normalization for the prototype-based anomaly detection. As shown in Table 7, k-means with Euclidean distance on the L2-normalized features (Norm L2) outperforms k-means with Euclidean distance on the original features (L2) in both anomaly detection and anomaly localization and achieves greater improvements in anomaly detection.

When the features are L2-normalized, cosine similarity and Euclidean distance are equivalent in the sense of nearest neighbor searching. Therefore, we use cosine similarity for clustering and measuring the deviation of test patch features from norm patch features. We further implement cosine similarity with a \(1\times 1\) convolution and append it to the feature extraction network. Therefore inference can be performed in an end-to-end manner.

4.5 Results on BTAD

In Table 8, we report the results of our method on the BTAD dataset and compare them with those of the SOTA OOD-based method (SPADE, PaDiM, and ProtoAD) and the approaches adopted in [32]. In [32], three reconstruction-based methods have been evaluated, auto-encoder (AE) with MSE loss, auto-encoder with MSE and SSIM loss, and Vision-Transformer-based image anomaly detection and localization (VT-ADL). We report the image-level and pixel-level AUROC for each category and their average for all categories. For anomaly detection, ProtoAD achieved the best image-level AUROC. For anomaly localization, ProtoAD achieved the second-best pixel-level AUROC (97.0), very close to the best one (97.4) achieved by PaDiM. These results show our method’s potential to generalize to new anomalous scenarios.

5 Conclusion

We propose ProtoAD, a new OOD-based image anomaly detection and localization method. First, a pre-trained neural network is used to extract features for image patches. Then, a non-parametric clustering algorithm learns the prototypes for normal patch features. Finally, an image anomaly detection and localization network is constructed by appending the feature extraction network with the L2 feature normalization, a \(1\times 1\) convolutional layer, a channel max-pooling, and a subtraction operation. As a result, ProtoAD does not need a network training process and can conduct anomaly detection and localization in an end-to-end manner. Experimental results on the MVTec AD dataset and the BTAD dataset show that ProtoAD can achieve competitive performance compared to state-of-the-art methods. Furthermore, compared to other OOD-based methods, ProtoAD is more elegant and efficient. And compared to the reconstruct-based methods, ProtoAD does not need a cumbersome network training process. Therefore, it can better meet the requirements of real applications.