Estimating the perceived dimension of psychophysical stimuli using triplet accuracy and hypothesis testing

Abstract Vision researchers are interested in mapping complex physical stimuli to perceptual dimensions. Such a mapping can be constructed using multidimensional psychophysical scaling or ordinal embedding methods. Both methods infer coordinates that agree as much as possible with the observer’s judgments so that perceived similarity corresponds with distance in the inferred space. However, a fundamental problem of all methods that construct scalings in multiple dimensions is that the inferred representation can only reflect perception if the scale has the correct dimension. Here we propose a statistical procedure to overcome this limitation. The critical elements of our procedure are i) measuring the scale’s quality by the number of correctly predicted triplets and ii) performing a statistical test to assess if adding another dimension to the scale improves triplet accuracy significantly. We validate our procedure through extensive simulations. In addition, we study the properties and limitations of our procedure using “real” data from various behavioral datasets from psychophysical experiments. We conclude that our procedure can reliably identify (a lower bound on) the number of perceptual dimensions for a given dataset.

summarises possible conversions from various comparison-based tasks to triplets, such that these responses are usable with our proposed procedure. These converted triplets are partially dependent such that scaling performance might not be comparable between sampled and converted numbers of triplets. From a psychological perspective, one should be even cautious when comparing responses from conversions, because the actual task, the instructions, and the context (i.e. fewer or more presented stimuli) probably influence the responses. Mathematically these conversions are sound as long as the triangle inequality holds in the perceptual space, a reasonable assumption. Table 1. Conversions from other comparison-based tasks to triplets. Triplets denote the response by order of the stimulus indices (anchor, chosen, other), and curly brackets are a short-notation for repetition of the same triplet with all index variants in the bracket, e.g.
In the examples below, we denote duplicated triplets with exchanges in one position by curly braces.

B Normal distribution of accuracy samples
The procedure proposed in this work assumes normally distributed test accuracies from repeated scale estimates, which is grounded in this paper's main part from a theoretical perspective. Here we show the practical grounding in the form of a brief simulation experiment and a statistical test for normality.
We looked at two different accuracy samples: Accuracies from 100 independently simulated triplet datasets of the same ground-truth scale to approximate the actual accuracy distribution, the so called hand-off accuracies, and cross-validation i accuracies of a single simulated dataset such as is used in our procedure. We simulated every dataset of 2, 000 triplets with a 3D-normal scale (medium noise) as described in the paper's simulation section. In the hand-off setting, we estimated the scale with 1, 800 triplets and calculated the accuracy with 200, and in the cross-validation configuration, we used ten repetitions of 10-folds.
The histogram of both accuracy samples is shown in Figure 11 along with the sample means (vertical line) and intervals of two standard deviations (dashed line). The hand-off accuracy means, used as a proxy of the actual accuracy, is included in the corrected interval of the cross-validation samples but overestimates the accuracy spread. The normality of both accuracy samples was tested with a combined skew and kurtosis test (D'Agostino & Pearson, 1973). Both settings failed to reject the null hypothesis; their samples are normally distributed (CV: s 2 + k 2 = 2.46, p > .05, test set: s 2 + k 2 = 0.47, p > .05).  which is the variation underestimated to be corrected in statistical tests (Nadeau & Bengio, 2003).

C Noise visualization
The influence of triplet number and the judgment noise on scale estimates is illustrated in Figure  ii

D Algorithm details
The pseudo-code Algorithm 1 shows the algorithmic details of repeated cross-validation and the testing corrections used in our dimension testing procedure.
Algorithm 1 Procedures for repeated CV and corrected t-test.
for all j ∈ 1..r do Repeat CV r times. 3: for all (s train , s test ) ∈ S do k-fold Cross Validation. 6: Evaluate on test triplets.

E Simulating a psychophysical experiment
Here we show simulation result where the ground-truth scale is inspired by actual psychophysical scales-the idealized hue and pitch perception as a wheel and a helix. iii

E.1 The hue perception wheel
This experiment used a two-dimensional hue circle as a realistic example of multi-dimensional ground-truth scales ( Figure 1). While we simulate triplets from this ground-truth scale, the ground-truth scale is not artificial but estimated from psychological data. The scale was estimated from pairwise hue dissimilarities (Ekman, 1954) with the multi-dimensional scaling algorithm (Shepard, 1962).
Our procedure correctly estimated two dimensions in most settings, as shown in Figure 13.

E.2 The pitch perception helix
The ground-truth scale used in this section is a three-dimensional helix (Figure 14), that is inspired from models of pitch perception (Shepard, 1965) but not based on behavioural data. As in the simulation experiments, we created a ground-truth scale and simulated responses, including normally distributed judgment noise. The ground-truth helix has three rotations with 12 tones (an octave) each, where the height of a rotation equals the helix's diameter.
Our procedure reconstructs the three-dimensional structure in the setting with high noise and low accuracy ( Figure 15).
Surprisingly, the noisier setting shows another reasonably accurate representation with a single dimension, an unrolled version of the helix. This more straightforward, unrolled representation is preferred if less data is available. This trade-off between unrolled 1D and helix 3D representation should depend on the helix's diameter-to-height ratio.  The p-values tell two different interpretations: From noisy or small datasets, just the one-dimensional perceived pitch is reconstructed.
p-values for the large dataset show, that-provided enough data-a three dimensional representation can represent additional nuances (the helix-like similarities between octaves).

F Overview of simulation results
The Figure 16 shows an overview of dimension estimates across all the normal datasets. No simulation overestimated the ground-truth dimension. While most simulations predicted the correct dimensionality, some underestimated it, especially at high noise and large ground-truth dimensions. The noise might shadow distinctive distance information of additional dimensions, so we can interpret these dimension estimates as a lower-bound dimension estimate. In psychological practice, such conservative or lower-bound dimension estimates are beneficial as they provide the simplest model that explains the collected data-given the inherent noise in the data. v 6 5 4 3 2 1 0 Residual dimension  1d-Normal (n = 60) 2d-Normal (n = 20) 2d-Normal (n = 60) 3d-Normal (n = 20) 3d-Normal (n = 60) 3d-Normal (n = 100) 8d-Normal (n = 60) 8d-Normal (n = 100) Noise low med high Figure 16. Overview of difference between estimated and ground-truth dimension. Overestimating dimension occurred just for onedimensional datasets, while underestimation occurred more often for higher dimensions and larger noise. x