Mode Combinability: Exploring Convex Combinations of Permutation Aligned Models

We explore element-wise convex combinations of two permutation-aligned neural network parameter vectors Θ A and Θ B of size d . We conduct extensive experiments by examining various distributions of such model combinations parametrized by elements of the hypercube [0 , 1] d and its vicinity. Our findings reveal that broad regions of the hypercube form surfaces of low loss values, indicating that the notion of linear mode connectivity extends to a more general phenomenon which we call mode combinability . We also make several novel observations regarding linear mode connectivity and model re-basin. We demonstrate a transitivity property: two models re-based to a common third model are also linear mode connected, and a robustness property: even with significant perturbations of the neuron matchings the resulting combinations continue to form a working model. Moreover, we analyze the functional and weight similarity of model combinations and show that such combinations are non-vacuous in the sense that there are significant functional differences between the resulting models.


Introduction
Linear mode connectivity (LMC) is an extensively studied phenomenon associated with deep neural networks.The possibility of interpolating trained models in the weight space while maintaining a relatively low loss value is an intriguing property, whose understanding is of both theoretical and practical interest.It offers many valuable insights into the structure of the loss surfaces and the underlying regularities of deep learning models (Entezari et al., 2022; The remainder of the paper is organised as follows.Section 2 presents an overview of related work.Section 3 provides the necessary preliminaries, introduces the element-wise combinations, and the tools for our investigation formally.Section 4 exposes the experimental exploration of element-wise convex combinations.In Section 5, we go beyond convex combinations and study parametrizations passing the boundary of the hypercube.This section also covers the transitivity property of LMC, a study of combining three models, and demonstrates model combinations that outperform the originals.In Section 6, we investigate functional and weight differences of model combinations, and explore the robustness of model alignment to perturbations.Finally, conclusions are presented in Section 7.

Linear mode connectivity and model re-basing
The first mode connectivity observations (Draxler et al., 2018;Garipov et al., 2018) were looking at piece-wise linear mode connecting paths.Linear mode connectivity was first demonstrated by Frankle et al. (2020) in the context of the Lottery Ticket Hypothesis focusing specifically on networks that were trained from the same initialization.Entezari et al. (2022) conjectured linearly mode connected SGD solutions even for networks that have been trained from different initializations.A crucial ingredient of their approach is that one should take into account the permutation symmetries of neural networks.The work of Ainsworth et al. (2023) then realized such permutations layer-wise with a simple greedy neuron matching algorithm family dubbed Git Re-Basin.The results in Ainsworth et al. (2023) were demonstrated only on networks using Layer Normalization (Ba et al., 2016) and sufficiently wide networks (16x or 32x wider than the regular baseline).It was Jordan et al. (2022) who extended their results to Batch Normalized (Ioffe and Szegedy, 2015) networks, and networks that have been trained without normalization -thus, significantly weakening the necessary conditions concerning normalization for the Git Re-Basin algorithm to work.

Model merging and knowledge aggregation in foundation models
An important related area is model merging McMahan et al. (2017); Singh and Jaggi (2020); Matena and Raffel (2022), where the objective is to amalgamate the knowledge captured in distinct models by operating on the network weights.The findings related to LMC suggest at first glance an unorthodox and surprisingly simple methodology to accomplish this task: by taking the average of model weights.The strangeness of this approach stems from the contrast between the simplicity of the addition operation performed in the weight space and the potential complexity of the task of aligning and extending the generalization capabilities of deep learning models.Recent advancements regarding changing, fine-tuning, or merging knowledge in foundation models (Hu et al., 2022;Dettmers et al., 2023;Meng et al., 2022) can also be interpreted as additive operations in the weight space.These approaches add low-rank residuals with a the aim of maintaining of extending functionality.

Preliminaries and notation
Let A and B denote two models which have identical architectures with d ∈ N parameters but were trained from different random initializations and batch ordering.Let Θ A ∈ R d and Θ B ∈ R d denote the weights of the two models after being trained to convergence.With a slight abuse of notation, we will identify models with their weight vectors and, e.g., refer as model Θ A and model Θ B to the model A and B having their parameters set to Θ A and Θ B , respectively.
Linear mode connectivity and the approximate convexity of the loss basin is captured through the notion of loss barrier.Definition 3.1 (Loss barrier (Frankle et al., 2020)).Given two models Θ A and Θ B , such that L(Θ A ) ≈ L(Θ B ), the loss barrier is defined as , where L is the utilized loss function.Two networks are considered linear mode connected if the loss barrier between them is small (zero, or near zero).Ainsworth et al. (2023) demonstrates zero-barrier linear mode connectivity between Θ A and π(Θ B ), where π(Θ B ) is a weight vector obtained by permuting the neurons of Θ B .The permutation leaves the network π(Θ B ) functionally identical to Θ B , while 're-basing' it to the basin of Θ A .This can be perceived as establishing an alignment between the two models, thus, we use the term 'aligned' for such Θ A and π(Θ B ) model pairs.The permutation is obtained by an iterative greedy matching algorithm of which they present three variants (matching activations, weights, or learning with a straight-through estimator).We treat these algorithms as black-boxes.For completeness, we reiterate the weight matching algorithm in Appendix C, but refer to (Ainsworth et al., 2023, Section 3) for further details.
In the present work, we extend the space of possible model combinations from convex combinations to element-wise convex combinations.
is the all-ones vector, and ⊙ is the elementwise product.

So for every element Θ
), we can choose the coefficients v i of the convex combination independently.(With this notation, the experiments in Ainsworth et al. (2023) belong to the special case when v = α = (α, . . ., α) for an α ∈ [0, 1], and Θ B = π(Θ B ), where π is the permutation found by the model aligning procedure.)The element-wise convex combination of model parameters Θ A and Θ B determine a hyperrectangle which is thus parametrized by the hypercube.
We often deal with samples from model distributions supported on this hyperrectangle (or beyond), so we extend the definition of loss barrier accordingly.For a sample from a model combination distribution, we define the empirical loss barrier by comparing the performance of the worst-performing model in the sample with the average performance of the original models the samples were deduced from.
Definition 3.3 (Empirical loss barrier on a sample).Given two models Θ A and Θ B , such that L(Θ A ) ≈ L(Θ B ), and a sample {v 1 , v 2 , . . ., v n } of size n ∈ N from a distribution p(v) of model combination parametrizations supported on R d , the empirical loss barrier of this sample is defined as , where L is the utilized loss function, and Θ vi is the combined model formed with the element-wise combination coefficient vector v i .
We define the empirical accuracy barrier on a sample of model combinations analogously, by subtracting the worst-performing model in terms of accuracy from the average accuracy of the two original models.

Model combinations in the hyperrectangle
A complete empirical exploration of element-wise convex combinations is infeasible even for small networks.Thus, we conduct our investigation instead by taking notable and meanwhile feasible distributions of model combinations, and investigate how they behave under the selected conditions.

The general experiment setup
We conduct our experiments with two different vision architectures: ResNet-20 (He et al., 2016) and a simple non-residual convolutional network called Tiny-10 (Kornblith et al., 2019).We train both networks on the CIFAR-10 dataset (Krizhevsky and Hinton, 2009), starting from different initializations and using different batch orders.We follow rather standard training methodologies which (along with the precise architectural descriptions) are detailed in Appendix A. For a more compact exposition, in the main text of this paper the figures are mostly correspond to ResNet-20, and the results for the Tiny-10 network can be found in Appendix B.
We follow the filter weight matching algorithm presented in Ainsworth et al. (2023) to obtain an aligning permutation π.(The three algorithms presented in their paper are very close in performance.)The baseline model width is set in such a way that the first convolutional layer has 16 filters.For our experiments, we use models with a width multiplier of 32 if not noted otherwise -a width where model re-basing already works in a stable manner.
For the sake of comparison we often show the results for combinations between original ('naïve') models, not just between aligned ('permuted') models.

Sampling from the unit hypercube
In this section, we look at different types of distributions on the unit cube [0, 1] d .We use samples from these distributions as parametrizations of elementwise convex model combinations.Most of the distributions we used are designed in such a way, to have a real-valued parameter that serves a similar role to an 'interpolation coefficient'.Figure 1 illustrates schematically some of the model combinations presented in this section.

Sampling from the uniform distribution on the unit hypercube
The first distribution we experiment with is the uniform distribution on the unit cube (corresponding to model combinations where each weight gets its own interpolation coefficient independently and uniformly).In Figure 2 we depict model results with uniform distribution between [0.5 − s, 0.5 + s] for various s ∈ [0, 0.5] (note that s = 0.5 corresponds to the uniform sampling from [0, 1] d ).
We observe that all the sampled model combinations attain high accuracy and low loss values, with empirical loss barrier 0.044 and empirical accuracy barrier 0.012.

Extending the interpolation: uniform distribution on a smaller cube
We now extend the single parameter linear interpolation with an approach that still has a single parameter and retains the two original models as endpoints, however, it differs in that it samples from smaller cubes when interpolating.Here, we again have a parameter λ ∈ [0, 1], which controls which subcube of  plot the loss and accuracy of the combined models in Figure 3.We observe low barrier values and high accuracy, with empirical loss barrier 0.041 and empirical accuracy barrier 0.011.

Uniform distribution on the intersection of the unit cube and a hyperplane
Here we present another natural method of interpolating between the two models, by sweeping through the hyperrectangle with a hyperplane.
We want to sample uniformly from the intersection of the hyperplane {x ∈ x i = α} and [0, 1] d , for some parameter α ∈ [0, 1].This polytope, denoted by P (d, α), is hard to sample from.Hence, we instead sample from another distribution that is a good approximation of it.
Taking d i.i.d.instances of any one-dimensional exponential distribution f (x; λ), the density function of the joint distribution is constant on any hyperplane 1 d d i=1 x i = α.Hence, over [0, 1] d , the same is true for f (x; λ), the same distribution truncated to [0, 1].This suggests the following procedure: when given the task of uniformly sampling from P (d, α), we instead choose a λ such that E[ f (x; λ)] = α, and pick d i.i.d.samples from f (x; λ).This procedure gives identical results to "mishearing" α, and sampling from P (d, α + ε) instead of the required P (d, α), for some ε random noise.The central limit theorem guarantees that the magnitude of this ε noise is O( 1 √ d ) with high probability.The appropriate f (x; λ) can be found by numerical optimization; incidentally, it is the maximum entropy distribution with support [0, 1] and mean α.
The results are shown in Figure 4. Again, we see model combinations that expand the range of well-performing models (empirical loss barrier: 0.037, empirical accuracy barrier: 0.012).

Sampling the vertices: Bernoulli distribution
Next, we sample the coordinates of the v vector independently from a Bernoulli distribution with parameter p ∈ [0, 1].This refers to model combinations where each weight of the combined network is equal to either the corresponding weight of Θ A or Θ B . Figure 5 presents the results for different parameter values of the Bernoulli distribution represented on the horizontal axis.We again observe that all model combinations result in high-performing models (empirical loss barrier: 0.168, empirical accuracy barrier: 0.042).

Aligned models are stitchable with identity stitching
Another notable set of hypercube vertices corresponds to model stitching (Lenc and Vedaldi, 2019;Csiszárik et al., 2021;Bansal et al., 2021), where one wants to combine the lower part of model Θ A with the upper part of model Θ B with a stitching map making the conversion between the two representation spaces.The stitching map is usually constrained to have low complexity, e.g., be an affine transformation.In our parametrization, for a given layer l ∈ {1, . . ., L} in an L layered network, we set the coefficients to 0 for each layer i ≤ l and 1 for layers i > l.Such coefficients correspond to model stitching of model π(Θ B ) and Θ A with the identity stitching map. Figure 6 depicts the results.We observe that all resulting configurations attain high accuracy and low loss.Therefore, while it is necessary to have networks wide enough for the re-basin algorithm to work, with that premise, a successful stitching between Θ A and Θ B (note the absence of π) can be achieved using only permutation matrices -functions of much lower complexity than arbitrary affine transformations.

Discussion
Figure 7 presents the role of network width for several of the above sampling schemes.(Note that while in this figure each presented combination method    involves a real parameter we use to interpolate, the underlying sampling mechanisms differ.)We find that for networks wide enough, the mixed models of π(Θ B ) and Θ A have high accuracy and low loss value.We also observe that the network width remains a necessary condition for model combination to succeed.
In conclusion, our results reveal that the implied volume where the combination 'works' is vastly larger than the line segment connecting π(Θ B ) and Θ A , the main interest of former linear mode connectivity research.

Searching for parts of the hyperrectangle that 'do not work'
All our our model combination efforts so far resulted in 'working' models.We now aim to purposely construct points in the hypercube which correspond to model combinations we expect to have high loss values -i.e., 'do not work'.We intend to design convex combinations that disrupt the network functionality  with having in mind either the inner structure of convolutional filter pairs, or their supposed interplay or covariance.

Always choosing the smaller or the larger weight
We find that when taking for each pair of weights (w a , w b ) the one for which the absolute value is smaller, this new 'Min' model has low accuracy.Figure 8 shows linear interpolation between the 'Min' model and the 'Max' model, where for each weight pair we choose the one with the largest absolute value.
This suits as a counterexample to the conjecture that any point of the hypercube leads to a well-performing combination.

Model combinations beyond the hyperrectangle
While the hyperrectangle spanned by Θ A and π(Θ B ) provides a natural space of possibilities for model combinations, we do not suppose that the faces of this box serve as a hard boundary for the loss basin.Conversely, it is reasonable to anticipate that the low loss region expands beyond the extremal points.In this section, we test this hypothesis by exploring such combinations.

Linear extrapolation
First, as perhaps the simplest approach, we extend the range of the linear interpolation coefficient λ ∈ R from the unit interval to [−1, 2] (note that the subinterval [0, 1] corresponds to the original linear interpolation between model A and model B). Figure 9 depicts the results for ResNet-20.We can observe that the extrapolated models perform well within certain segments of the extended range.Specifically, within the intervals of λ ∈ [−0.25, 0] and λ ∈ [1.0, 1.25], the train loss and accuracy of the resulting model combinations are virtually identical to the originals.On the test set, the performance starts  to degrade slowly as we pass the boundaries of the unit interval, still resulting in considerable segments of well-generalizing models.We can also observe that with permuted models the performance degrades significantly slower than with the naïve counterparts.This suggests that aligned weight pairs result in a less disruptive perturbation regarding functionality.

Sampling uniformly from Θ A centered hyperrectangles
We now turn our attention to element-wise model combinations which extrapolate.As an instance of such an experiment, we shift the uniform sampling from the hyperrectangle spanned by Θ A and π(Θ B ) to be centered at Θ A .This corresponds to model combinations where negative terms are also allowed, i.e., we choose our combining coefficients from [−s, s] d for a given s ∈ R.This way, we obtain a box that corresponds to perturbed versions of Θ A obtained by applying an additive uniform noise scaled according to the resulting weight differences of the alignment process.(Informally, if a weight matched to a weight with a similar value, that corresponds to a small edge of the box, and thus, results in a small uniform perturbation of that weight; the case of large differences plays out analogously.)In this way, we connect the magnitude of the applied noise to the 'degree of determination' of weight values resulting from the matching procedure.Figure 10 depicts the results for various s ∈ [0, 0.5].We observe that the model combinations of aligned models work well for s ∈ [0, 0.25].

Transitivity of linear mode connectivity
The question naturally arises whether two models π B (Θ B ) and π C (Θ C ) rebasined to a third model Θ A are also linearly interpolable.We answer this question affirmatively.Figure 11a depicts such configurations, and we can observe high accuracy and low loss values for these combinations.

Combining three models
We also investigate the convex combinations of the three models.Figure 11b depict such convex combinations written up in the form , where λ B , λ C ∈ R are the interpolation coefficients controlling the extent to which we step from Θ A in the direction of π B (Θ B ) or π C (Θ C ).The figure shows that the test accuracy of even the worst performing model is above 0.91 still which we can consider a well-performing model.
There are more observations to be made in Figure 11b.First, combined models can surpass the performance of the originals; second, in the depicted example the best-performing model is at an interior point of the depicted triangle.
While similar observations of better-performing interpolated models have been made in Ainsworth et al. (2023, see Figure 5), only in the case when they utilize a split data training (merging two models trained with two disjoint datasets on CIFAR-10), or in the case of a much simpler dataset of MNIST.In our case, the training dataset is the same for all source models, and the CIFAR-10 classification task is much more complex than that of MNIST.We observed the same behavior for ResNet-20 models of different width multipliers ranging from 10 to 40.However, in many cases, the best performing model was on an edge of the triangle.Overall, the above observations suggest that even slight model differences -originating only from the different weight initializations -can be exploited to obtain better performing combined models, and it is worth pursuing combining multiple models.
Also note, that in this case, we interpolate using only two real parameters.This naturally leads to the question of whether more sophisticated (e.g., well-chosen element-wise) model combinations could result in even higher performance.In light of the experiments presented in this paper, we find this an intriguing research direction.However, this question is beyond the scope of this article, and we leave it for future work.

Functional and weight dissimilarity of model combinations
In the abundance of well-performing model combinations, it is reasonable to ask whether these combinations are vacuous in the sense that the original models Θ A and π(Θ B ) could be nearly identical, rendering the space of combinations effectively empty.While the examples above of model combinations surpassing the performance of the originals already suggest the contrary, we are not aware of any work in the literature that looks 'under the hood' to highlight such differences in a more detailed manner.In this section, we conduct several experiments to demonstrate that there are indeed significant functional differences both between the original models and their combinations.

Functional difference of model combinations
Do the original models or their combinations output the same labels?-First, we investigate network-level functional similarity by comparing the predictions of different model combinations.We use the predicted labels as the ground of comparison as these serve as an easily interpretable aggregation of model functionality.More precisely, to test a model combination for its similarity to the original models Θ A and Θ B , we take the test set of the CIFAR-10 dataset and count how many of the data points fall into each of the following four categories: the label prediction of the output matches the prediction of Θ A only, Θ B only, neither, or both.In each category, the lighter and darker shade depict whether the prediction of the model matches the true label ('correct') or not ('wrong').
Figure 12 depicts the results for different model combinations.For each category we also represent whether the predicted label matches the true label.We  can observe that in the case of linear interpolation, Bernoulli, cube intersect, and plane intersect combinations of the two endpoints (corresponding precisely to Θ A and π(Θ B )) show significant functional difference, i.e., there is a substantial fraction of the test set for which they give differing labels.By tracking the variations along the horizontal axis in these figures, we can observe that these differences change gradually when interpolating between the endpoints.Obviously, we see a different picture for uniform combinations; in this case, both endpoints correspond to an already mixed model.

Edge lengths of the hyperrectangles
How much the weights of the two original models differ?-We now consider the element-wise weight differences of aligned model pairs Θ A and π(Θ B ). Figure 13 shows the distribution of edge lengths (i.e., absolute differences of aligned weight pairs) for the hyperrectangles spanned by Θ A and π(Θ B ).On one hand, we can observe that there are a diverse variety of lengths, and a significant portion of them differ from zero.On the other hand, there is a large portion of near-zero lengths.We attribute this fact to the general sparsity property of deep neural networks that implies a large portion of low weight values.(Also, we used weight decay in our training procedure which might further reinforce this feature.)

Robustness to alignment perturbations
Are all the filter pairings important?-To measure the robustness (or conversely, the vulnerability) of filter pair matchings regarding model performance, we deliberately corrupt parts of the alignment and measure how much it affects the accuracy of the combined model.First, for a given layer we order the filter pairs based on their activation correlation in a decreasing order.Then, we re-match randomly the k lowest scoring pairs by taking a derangement π r (a random permutation where no element remains in its original position) on the k filters.A specific property of this perturbation is that it maintains the overall magnitude and the distribution of activations of a layer while disturbing the filter pairings for the interpolation.To measure combination performance we interpolate between the resulting π r (π(Θ B )) and Θ A for a given layer and report accuracy barriers.(Note that permutations of a single model preserve perfect functional similarity as the subsequent layer's input ordering is adjusted appropriately when re-basing.However, when we combine the perturbed model with another one, it is appropriate to measure how the perturbation affects model combination.) Figure 14 depicts the loss and accuracy barrier for resulting model pairs with values of k ranging between 2 and the number of filters N in the given layer.We observe strong robustness to the alignment perturbations.For each layer, a large portion of the alignment can be randomly interchanged.Remarkably, for Layers 1, 6, 7, and 8, even a complete derangement (k = N ) leaves the combined networks well-performing.For the other layers, we can observe mild degradation of performance for up to half of the filters, where the deterioration starts to accelerate.
With a derangement π r (Θ B ), we replace an aligned filter with a randomly chosen other one.When interpolating, the filters of Θ A thus are exposed to a random noise distributed as the corresponding filter weights of Θ B .(We hypothesize that this distribution closely mirrors the distribution of filter weights in Θ A itself.)We attribute the above remarkable observations of robustness to the fact that this distribution has a large probability mass on near zero weights, or on weights whose difference is smaller than the supposed extent of noise resilience of filter functionality.
In this experiment, we ordered the filter pairs according to activation correlations.We speculate that filters that are more crucial for the overall network functionality might appear in both networks more unambiguously, can be identified in the process of re-basing more clearly, and as a pair, they exhibit larger activation correlations.The disruption of these important filters is what we might observe on the right-hand side of these figures.One might consider including further filter importance measures and model pruning in an analysis, but such explorations extend beyond the scope of the present paper, and we leave it for future work.

Conclusion
We showcased numerous model combination methods, indicating that the loss basin studied by previous LMC research is much broader than formerly identified.This demonstrates that LMC is a special case of a more general model combinability phenomenon.We investigated model combinations in terms of performance.We examined weight and functional similarities on several granularities and identified a general robustness and noise resilience property.
Model combinations in a more general sense -going back to, e.g., classical ensemble learning -were always an important aspect of practical and theoretical machine learning.Deep learning models exhibit immense complexity, yet, model combination in the weight space is developing as a promising direction for knowledge aggregation and extending generalization capabilities.Uncovering and exploring novel ways of model combinations, and extending work on the symmetries and regularities in deep learning models help taming such complexities and foster developing methods that exploit model combinability.

Figure 1 :
Figure 1: Schematic figures of combinations of a 'blue' and an 'orange' model.Colored rectangles depict the layers of the combined networks, in which each square cell represents a weight obtained by mixing the corresponding weights of the two models.The weight value is represented by the brightness of the cell, and the mixing is illustrated by interpolating between the colors of the blue and the orange models.For this illustration, the weights are distributed uniformly.

Figure 2 :
Figure 2: Performance of ResNet-20 model combinations corresponding to uniform sampling of element-wise coefficients between [0.5 − s, 0.5 + s].Horizontal axes denote s (with 25 values chosen equidistantly), vertical axes denote performance in terms of (a) loss and (b) accuracy.

Figure 5 :
Figure 5: Performance of ResNet-20 model combinations for the Bernoulli distribution on the cube.Horizontal axes denote the Bernoulli parameter p (with 25 values chosen equidistantly).

Figure 6 :
Figure 6: ResNet-20 identity stitching.Horizontal axes denote the layer where the stitching is realised, vertical axes denote performance in terms of (a) loss and (b) accuracy.

Figure 7 :
Figure 7: Different model combinations and their performance with different network widths.Network is ResNet-20.Plotting performance in terms of (a) loss barrier and (b) accuracy barrier (i.e., the worst performing model is depicted for each sampling scheme and width multiplier).

Figure 8 :
Figure 8: Linear interpolation between the 'Max' model and the 'Min' model.The model was ResNet-20.Horizontal axes denote the λ interpolation coefficient, vertical axes denote performance in terms of (a) loss and (b) accuracy.

Figure 9 :Figure 10 :
Figure 9: Linear extrapolation with models trained on ResNet-20.Horizontal axes denote the λ interpolation coefficient, vertical axes denote performance in terms of (a) loss and (b) accuracy.

Figure 11
Figure 11: (a) Interpolating between two models π B (Θ B ) and π C (Θ C ), where the permutations π B and π C re-basin models Θ B and Θ C to the basin of a common model Θ A .(b) Test accuracy heatmap for linear combinations of three models: Θ A , π B (Θ B ), and π C (Θ C ).For visual convenience, the coefficients λ B and λ C are transformed to form an equilateral triangle.Θ ⋆ denote the best performing model.The model is ResNet-20 with a width multiplier of 40 to highlight an example where the best performing model is an interior point.

Figure 12 :
Figure 12: Network level functional comparison of ResNet-20 model combinations.The stacked plots illustrate the distribution of the 10,000 CIFAR-10 test datapoints for different model combinations across four categories: the predicted labels match with Model A only, Model B only, neither, or both.

Figure 14 :
Figure14: Measuring filter pair importance by randomly re-matching the lowest scoring k pairs.Horizontal axes denotes the number of re-matched filters, vertical axes denote the loss barrier interpolating between the resulting models.
Figure B.26: Network level functional comparison of ResNet-20 and Tiny-10 linear interpolation.The stacked plots depict how for different model combinations the 10000 datapoints of the CIFAR-10 test set are distributed among the following four categories: the predicted labels match with Model A only, Model B only, neither, or both.