A predictive coding approach to modelling the perceived complexity of popular music drum patterns

This study presents a method to estimate the complexity of popular music drum patterns based on a core idea from predictive coding. Specifically, it postulates that the complexity of a drum pattern depends on the quantity of surprisal it causes in the listener. Surprisal, according to predictive coding theory, is a numerical measure that takes large values when the perceiver's internal model of the surrounding world fails to predict the actual stream of sensory data (i.e. when the perception surprises the perceiver), and low values if model predictions and sensory data agree. The proposed new method first approximates a listener's internal model of a popular music drum pattern (using ideas on enculturation and a Bayesian learning process). It then quantifies the listener's surprisal evaluating the discrepancies between the predictions of the internal model and the actual drum pattern. It finally estimates drum pattern complexity from surprisal. The method was optimised and tested using a set of forty popular music drum patterns, for which empirical perceived complexity measurements are available. The new method provided complexity estimates that had a good fit with the empirical measurements (R2=.852). The method was implemented as an R script that can be used to estimate the complexity of popular music drum patterns in the future. Simulations indicate that we can expect the method to predict perceived complexity with a good fit (R2≥.709) in 99% of drum pattern sets randomly drawn from the Western popular music repertoire. These results suggest that surprisal indeed captures essential aspects of complexity, and that it may serve as a basis for a general theory of perceived complexity.


Introduction
The complexity of art or music has long been understood as an essential aesthetic dimension ( [1][2][3][4]), and there is an abundant body of research that investigates the relationship between the complexity of artifacts and how much observers appreciate them (for an overview, see Refs. [5,6]). Researchers seem to agree that the complexity of an artifact can be expressed as a quantity (and thus measured), and that it is sensible to state that one artifact is more complex than another. The concept of complexity in art is difficult to grasp, in spite of its frequent use. It seems to have an objective and a subjective component: on one hand, complexity is rooted in the artifact itself, in its objective properties, in the relationships between its parts and, as Pressing described it, in the "existence of structure on many scales or levels" ( [7], p. 1). On the other hand, complexity is also experienced subjectively by the person perceiving the artifact: understanding the structure of the artifact can be seen as a problem that needs to be solved, and the complexity of the artifact has been "defined as the minimal cost of computing an approximate solution to the problem" ( [7], p. 2).
This study proposes a new approach to measuring complexity, based on ideas from predictive coding (PC), that combines objective and subjective aspects. PC is a theory that aims at explaining cognitive functions such as perception, learning, and action ( [16,17]); it may potentially provide a unifying framework for the cognitive sciences ( [18]). PC is based on the idea that an organism's brain is essentially a prediction machine which compares incoming sensory data with the predictions of an internal model. The organism aims at avoiding discrepancies between the sensory data and the predictions, because these discrepancies may indicate an unexpected danger from the outside world. The organism has two possibilities to harmonise the sensory data with the predictions: it can either update the internal model in order to better predict the sensory data in the future (learning) or it can influence the outside world and thus supress discrepant sensory data (action, see Ref. [16], p. 2). During the last two decades, PC theory was formulated in rigorous mathematical terms ( [19,20]), and it was successfully applied to model cognitive functions in a variety of contexts and disciplines (for a comprehensive overview, see the preprint by Millidge et al. [21]).
This study aims at modelling the complexity of drum patterns from Western popular music using the PC concept of surprisal. Surprisal is a non-negative metric that, at each moment in time, quantifies the surprise triggered by the discrepancy between sensory data and the predictions of the internal model. This study hypothesises that the subjectively perceived complexity of a drum pattern depends on the total quantity of surprisal it causes in the listener. The quantity of surprisal will be calculated depending on the objective properties of the pattern (which represents the sensory data), on the expectations of the listener determined by their experience and cultural background (the listener's internal model), and on a mechanism that detects the regularities of the pattern as it unfolds and adapts the internal model to reduce surprisal in the future (learning).

Table 1
Perceived complexity (Bradley-Terry coefficients β i , see Ref. [22], p. 13 ) and estimated complexity (γ i ) for each of the 40 drum pattern stimuli. The modelling work in this study relies on a set of forty drum patterns from the Western popular music repertoire that was recently presented in a preprint by Senn et al. ([22]) and that was drawn from a larger corpus of drum patterns (Lucerne Groove Research Library, www.grooveresearch.ch). The forty drum patterns are paired with reliable perceived complexity measures, established empirically in a listening experiment. The success of the modelling effort, and thus the merit of the PC-related ideas to define complexity, will be judged based on the agreement between estimated complexity and the empirical measures of perceived complexity. This agreement is measured by the R 2 effect size (the square of the Pearson product moment correlation). The model will be successful if it outperforms the benchmark of R 2 = .491 set by the Index of Syncopation developed by Witek et al. ([9]), an objective measure that is frequently used to estimate drum pattern complexity, and the Revised Index of Syncopation by Hoesl and Senn ([23]) derived from Witek et al.'s, which achieves R 2 = .560 (for the fit of these existing measures with the empirical measurements, see Ref. [22], p. 18).
This study is not the first to use the PC framework in order to explain phenomena relevant to music cognition and rhythm. A series of neuroscientific studies ( [24][25][26][27]) used electroencephalography to investigate how single surprising rhythmic events are processed in the brain. Vuust et al. [24] showed that a surprising event triggered the mismatch negativity reaction (MMNm, ca. 110-130 ms after stimulus onset) and a P3am response (ca. 80 ms after the MMNm). They interpreted the MMNm reaction as the brain's detection of a discrepancy (which can be understood as a neural equivalent of surprisal), and the P3am as an evaluation of this discrepancy. Vuust et al. developed their ideas into a coherent theory called Predictive Coding of Rhythmic Incongruity or PCRI ( [25,26]) to which Lumaca et al. ( [27]) added further evidence and nuance. The current study shares some fundamental ideas on PC with these previous studies, but uses the ideas for different purposes. The neuroscientific studies focused on the implementation of PC-related physiological processes in the central nervous system, based on single surprising events. In the current study, the emphasis lies on the development of a mathematically sound measure of pattern complexity based on surprisal without considering the neural substrate. The neuroscientific studies operate on the lower implemental and algorithmic levels of Marr's hierarchy (see Refs. [28,29], p. 185) whereas the current effort can be located on the hierarchy's upper computational level.

Materials and methods: Perceived complexity of forty popular music drum pattern stimuli
All popular music drum patterns used in this study were taken from a corpus called the Lucerne Groove Research Library (www. grooveresearch.ch). The library currently consists of 251 popular music drum patterns in 4/4 common time that have a duration of 8 bars. Each pattern was originally played by a renowned drummer on a full-band recording in one of the major Western popular music styles (rock, pop, funk, soul, heavy metal, rock'n'roll, disco and others). The drum patterns were carefully reconstructed by Lorenz Kilchenmann as audio stimuli using drum samples, based on transcriptions and exact timing measurements by Toni Bechtold and Florian Hoesl. The majority of these stimuli were first used in a 2018 groove study by Senn et al. [30], which also provides more information about the corpus and stimuli preparation (see also published data and meta-data on the corpus in Ref. [31]).
In their preprint, Senn et al. ( [22]) used forty of the library's drum patterns (see Table 1) to create a stimuli set with perceived complexity measures. They selected patterns that cover a wide complexity range (as measured by the Index of Syncopation, see Ref. [9]) and that have a parsimonious instrumentation with bass drum, snare drum, and cymbals. Senn et al. ( [22]) shortened the patterns to 4 bars (plus the first beat of bar five). These forty patterns and the complexity estimates will be used in this study to train and test the model. The remaining 211 patterns of the library will be used to inform the model about the expectations of an enculturated listener with respect to the structure of popular music drum patterns.
Senn et al. ( [22]) estimated the complexity of the forty drum patterns on the basis of a listening experiment with an incomplete pairwise comparison design, in which 220 participants judged the relative complexity of the stimuli in a total of 4400 pairwise comparison trials. The data from the experiment were analysed using the Bradley-Terry probability model. This resulted in the perceived complexity (β i ) estimates listed in Table 1.
The Bradley-Terry estimates β i have a clear probabilistic interpretation: they allow to calculate the probability that one drum pattern is considered to be more complex than another drum pattern when judged in a pairwise comparison trial by a random member of the listener population. The estimated success probability Π ij that stimulus i is perceived as being more complex than stimulus j is: where the perceived complexity measures β i and β j are equal to the Bradley-Terry coefficients corresponding to stimuli i and j ( [32], p. 265). To make an example: the probability Π 3,23 that stimulus 3 ("Smells Like Teen Spirit") is heard as being more complex than stimulus 23 ("Jungle Man") is estimated at: Here, we used the β 3 = 0.476 and β 23 = 2.752 coefficients from Table 1. We can expect the "Smells Like Teen Spirit" drum pattern (stimulus 3) to be considered more complex than the "Jungle Man" drum pattern (stimulus 23) in only 9.3% of pairwise comparison trials. Conversely, the probability that "Jungle Man" is considered more complex than "Smells Like Teen Spirit" is: "Jungle Man" can be expected to win 90.7% of the trials against "Smells Like Teen Spirit". Since no ties are allowed in the pairwise contests, it is generally true that Π ij +Π ji = 1. Solving equation (1) for β i −β j yields: The difference β i −β j is the log-odds or logit of the probability Π ij that stimulus i is judged to be more complex than stimulus j by a random member of the listener population.

Surprisal and effort
During its lifetime, a sentient organism is subject to a stream of sensory information s(t) that varies over time t. According to predictive coding theory, the organism has an internal probabilistic generative model m(t), which expects s(t) with a certain probability. The intensity of the organism's surprise, or surprisal, at any point in time can be defined as follows (see also the definitions of surprisal in Refs. [16], p. 2, and [33], p. 64): Surprisal is the negative natural logarithm of the probability to experience the sensation s(t) at time t, given that the internal model is Let us assume, stimulus i triggers the surprisal function f i in a listener, as visualised in Fig. 1 (a). We define the effort E i it takes a listener to cognitively process stimulus i as the sum of the surprisal values between the beginning of the stimulus at time t 1 until it ends at t K : Note that, here and throughout this entire study, the time domain is understood as consisting of K discrete, equidistant time points t 1 , t 2 , …, t K at which the values of the surprisal function will be evaluated. For stimulus i in Fig. 1 (a), the surprisal values are relatively  large, whereas for stimulus j, Fig. 1 (b) shows lower surprisal values. Consequently, stimulus i is associated with a greater effort than j. Overall, surprising stimuli will be associated with a great effort.

Estimated complexity
In this paragraph, we develop the estimated complexity values γ i in several steps, using concepts such as the effort E i and the success probability Π ij that have been introduced earlier. The goal of this paragraph is to define estimated complexity γ i in such a way that it is equivalent to perceived complexity β i (empirically measured in Ref. [22]), and that, consequently, γ i and β i have the same probabilistic interpretation.
First, we define an expression for the success probability Π ij that stimulus i wins a pairwise comparison trial against stimulus j (and, conversely, the probability Π ji that stimulus j wins against i) as follows: where C > 0 is a constant that needs to be estimated, and E i , E j are the efforts associated with the two stimuli. Note that Π ij and Π ji are necessarily in the range [0, 1] adequate for probabilities, because E i , E j , C > 0. Also: Since there are no ties in the trials, the two probabilities must add up to 1. Also, if E i = E j , then: If the efforts associated with stimuli i and j are equal, then the success probabilities are equal at Π ij =Π ji = 0.5 as well. We define the estimated complexities γ i and γ j of stimuli i and j to be: Estimated Complexity j =γ j = log where D ∈ R is another constant. The difference γ i −γ j can then be expressed as follows, using (6), (7), and (5): From equation (8) follows that the γ i have the same interpretation as the Bradley-Terry coefficients β i shown in equation (2): This implies that perceived complexity β i and estimated complexity γ i are equivalent. Consequently, we can calculate the probability Π ij in analogy to equation (1): The constant C scales the efforts E i in order to allow for this probabilistic interpretation. Strictly speaking, the constant D is not necessary, since it is canceled in the difference γ i −γ j . However, it allows us to transpose the γ i in such a way that they take values in the same range as the Bradley-Terry β i and are thus easier to interpret. With the definition of estimated complexity in equations (6) and (7), two goals have been achieved: firstly, the estimated complexity γ i of stimulus i is entirely based on the effort E i to perceive stimulus i (and thus on the surprisal function associated with stimulus i) plus on two constants. Secondly, equation (9) shows that the estimated complexity γ i has the same interpretation as the perceived complexity of stimulus i expressed by the Bradley-Terry coefficients β i .

Table 2
Matrix S 13 representing the sensory data of the Bravado drum pattern, played by Neil Peart (stimulus 13). For a transcription of this pattern, see Fig

Modelling drum pattern complexity
We now move on to modelling complexity. In order to simplify, we make a few assumptions: Firstly, we consider the time dimension to be a grid of discrete 16 th note positions. In any 4-bar pattern, there are 16 sixteenth notes for each of the 4 bars, and the pattern ends on the first beat of bar 5. This results in K = 65 metric positions on which events may happen. Secondly, we understand each drum pattern to consist of L = 3 instrumental layers: bass drum (l = 1), snare drum (l = 2), and cymbals (l = 3). The forty drum patterns that are used for modelling in this study satisfy this second assumption, since they only feature bass drum, snare drum and different cymbals. Toms and other percussion instruments were explicitly excluded in the selection of these patterns (for pattern selection, see Ref. [22]). Thirdly, information on event loudness is discarded.
The discrete time and instrumentation dimensions allow to encode all relevant information in matrix form. The modelling process can then be carried out using methods of linear algebra. It consists of the following five steps: 1. Encoding the drum pattern matrix S i : In this step, the drum pattern i is encoded as a matrix in binary form. An onset is encoded 1 and the absence of an onset is encoded 0 for each time point k and each instrumental layer l. This information represents the sensory data, and it is collected in the 65 × 3 matrix S i (see an example in Table 2), which encodes the time structure of the pattern in its rows and the instrumentation in its columns. 2. Deriving the internal model M 0 of an enculturated listener from a corpus of drum patterns: This step generates the internal model (or expectations) of an enculturated listener based on previous experience. It specifies the probability of a note onset at each time point k in each instrumental layer l based on a corpus of drum patterns, and it represents the probabilities in the 65× 3 matrix M 0 (see Table 3). 3. Learning the pattern in a Bayesian process and updating the internal model M i : In this step, a learning process is implemented, which simulates how the listener adapts the internal model as the drum pattern unfolds in time and is being perceived. This process starts with the internal model M 0 of the enculturated listener. Yet, probabilities for every time point k and for every instrumental layer l are updated in a Bayesian way on the basis of the events in the drum pattern S i previous to k. This leads to the 65× 3 matrix M i , which represents the internal model as the drum pattern i unfolds and is being learned (see an example of M i in Table 6). In this Bayesian process, M 0 represents the prior distribution, S i is the new data, and M i is the posterior distribution. 4. Estimating complexity: In this step, the value of the surprisal function is estimated on the basis of the sensory data S i and the internal model M i for each time point k and each instrumental layer l. Large discrepancies between S i and M i will result in high surprisal.
The surprisal values will be stored in the 65 × 3 matrix Σ i , which will allow to calculate the effort E i and the estimated complexity γ i for each stimulus. An example of Σ i can be seen in Table 7. 5. Determining the strength λ of the prior model M 0 : In this step, the strength λ of the prior distribution (the enculturated internal model M 0 ) is determined such that the coefficient of determination R 2 between the perceived (β i ) and estimated (γ i ) complexity values is Table 3 Matrix M 0 : Probabilities that an event happens in an instrumental layer on a metric position, based on a corpus of n = 211 popular music drum patterns. The probabilities repeat every bar. Events of the generic backbeat pattern in bold print, see also Fig. 2  maximised. The parameter λ determines the inertia of M 0 in light of new data and can be understood as the inverse of the learning speed. Fig. 2 shows the generic backbeat pattern (a) and three transcribed examples of the 40 drum patterns used in this study (b-d). We understand the time dimension of these 4-bar drum patterns as a grid of 65 sixteenth note positions on which a note onset may happen or not. Each onset is associated with one of three instrumental categories: either the onset is a kick of the bass drum (notated above the bottom line of the staff), or it is a stroke on the snare drum (notated above the middle line), or it is an event involving one of the cymbals (most often notated above the top line). The drum patterns used in this study do not contain events played on any other drum (e.g. toms) or percussion instruments.

Encoding the drum pattern matrix S i
The 65 × 3 drum pattern matrix S i encodes the pattern information for drum pattern i. For an example, see Table 2, which shows the drum pattern matrix S 13 of the "Bravado" drum pattern, Fig. 2 (c). The K = 65 rows of the matrix encode the time information; and the L = 3 matrix columns encode the instrument information with bass drum (l = 1), snare drum (l = 2) and cymbals (l = 3). So, if in stimulus i there is an onset at time k in instrument category l, then S i(kl) = 1, otherwise S i(kl) = 0. Note that the transcribed note length is irrelevant: for an event notated as a quarter note in Fig. 2 (c), worth four 16th notes, only the first 16 th note position (where the onset actually happens) is represented as a 1; the remaining three 16 th note positions are notated as zeros. The binary encoding in matrix form reduces the drum pattern information substantially relative to the transcription and even more relative to the audio stimulus. Aspects of timbre (e.g. which cymbal is being played, playing techniques), dynamics, microtiming and tempo information are not represented: As outlined in the simplifying assumptions above, we only require the information whether, in drum pattern (i), there is a note onset (1) or not (0) in instrument category (l) at a certain point in time (k).

Deriving the internal model M 0 of an enculturated listener from a corpus of drum patterns
People with intact hearing who have access to audio-visual media (such as TV, radio, Youtube, etc.) and consume media content on a daily basis are very likely to come into frequent contact with Western popular music, even more so if they live in a Western country. In a majority of popular music songs, the drumset is an essential part of the instrumentation. So, adult listeners are likely to have heard thousands of popular music songs with drum patterns during their previous lifetime. They are likely to have at least an implicit knowledge of how such drum patterns are usually constructed and how they relate to the metric organisation of the music. We define this implicit knowledge as the internal model of a typical drum pattern that an enculturated Western listener has formed in their past life.
We do not have direct access to anybody's past experience or to their particular internal model. However, we can try to approximate the internal model of an enculturated listener by estimating probabilities based on the frequencies encountered in a corpus of drum patterns. In this study, we use 211 drum patterns from the Lucerne Groove Research Library as a corpus ( [30]). The entire corpus

O. Senn
consists of 251 Western popular music drum patterns in 4/4 common time. Forty of these patterns were used to form this study's stimuli set (see also [22]), listed in Table 1. These patterns were excluded from the corpus, and only the remaining 211 patterns were used to estimate the probabilities of the internal model. The 211 patterns consist of 7,196 bass drum kicks, 5,348 snare drum strokes, and 10,912 cymbal strokes. The probability that an event happens in one of the three instrumental layers on any particular metric position within a bar was estimated as the proportion of times the event happens on this metric position in the corpus of 211 patterns.
The proportions are listed in the 65 × 3 matrix M 0 (Table 3), whose rows repeat the bar profile of expectations for all 4 bars (plus a fifth repetition of the first downbeat at the very end). On the basis of M 0 , it is very probable that a bass drum event happens on the first downbeat of a pattern (m 0(1,1) = 0.914), but it is unlikely that the snare drum plays on this position (m 0(1,2) = 0.039). Yet, a snare drum event is strongly expected on beat two, (m 0(5,2) = 0.887). We observe in Table 3 that all matrix cells that indicate high probabilities (m 0(kl) > 0.5) correspond to events that are compatible with the generic backbeat pattern or "archetypical rock beat" ( [35], p. 11, see also Fig. 2 (a)). M 0 suggests that an enculturated listener expects to hear the boom-tchak pattern of the generic backbeat. M 0 represents a set of trainable parameters of the model that are empirically estimated from a corpus of drum patterns. These parameters are trainable, because the probabilities would take different values, if they were based (or trained) on a different corpus.

Learning the pattern as a Bayesian process and updating the internal model M i
According to predictive coding theory ( [16]), the organism's overall goal is to stay in healthy conditions. It tries to achieve this by keeping surprisal low and thus avoiding potentially dangerous surprises. The organism may improve the prediction capabilities of the internal model m(t) through a process of learning.
When a drum pattern starts to play, the listener interprets it using their internal model that we approximated in model M 0 , and that is based on previous experience. As the pattern unfolds, the listener gets to know the pattern's specific characteristics and adapts the internal model accordingly. So, listening is a learning process which consists in forming a new model M i that is more appropriate for pattern i than the generic model M 0 . We may understand this learning process as an adaptation of expectations that takes into account the pattern up to time k, the 4/4 time signature and the tendency of many drum patterns to repeat every two quarter notes (8 sixteenths), every bar (16 sixteenths), every 2 bars (32 sixteenths) or 4 bars (64 sixteenths). At every time point k in instrument layer l, the probability of an event will increase when there already were events in this layer at times k − 8, k − 16, k − 32, and/or k − 64, and decrease otherwise.
We model this learning experience as a Bayesian process for which the internal model M 0 serves as the prior, listening to the pattern i (using S i and a method to update the expectancies of the internal model) provides the new data, and the updated internal model M i is the posterior. This process is implemented as a conjugate analysis using beta distributions (see Ref. [36] p. 34ff.; [37], p. 24ff.; [38], p. 49ff.). Conjugacy implies that the prior and posterior distributions will belong to the same distribution family. Beta distributions are a family of probability distributions for real numbers defined on the range [0, 1] that can be used in conjugate analysis if the parameter of interest is a proportion or a probability. This is the case here, because we want to update the prior probabilities of M 0 to the posterior probabilities of M i at the 3 × 65 positions of a 4-bar drum pattern, consequently a conjugate analysis using beta distributions is possible.
We operationalise the updating process by first defining 65 × 3 beta distributions such that the modes of these distributions are equal to the 65 × 3 probabilities of the prior model M 0 . Each beta distribution is governed by two parameters, a and b ( [39], p. 480f.). Beta distributions with large a and small b generate random numbers near to 1. Conversely, beta distributions with small a and large b generate random numbers near to 0. In our modelling, we will use the location of the mode of the beta distribution as a point estimate to indicate the probability that an event happens in the drum pattern. The mode m of a beta distribution is given by: The strength (or inertia) of the prior distribution is given by: When λ takes a large value, then the prior distribution tends to be strong and new data will only have a weak influence on the posterior distribution. Conversely, when the prior distribution is weak (low λ), the new data will strongly affect the posterior distribution. The optimal value for parameter λ will be estimated at a later moment in this study in an optimisation process. We can substitute equation (11) into (10) and solve (10) for a and b obtaining: Expanding these formulas to calculate all 65 × 3 a and b values corresponding to the prior model M 0 simultaneously using matrix operations yields: where A 0 and B 0 are 65 × 3 matrices that collect the a and b values and J is the 65 × 3 all-ones matrix. With this, we have defined all 65 × 3 prior beta distributions that have their modes at M 0 and prior strength λ, see Tables 4 and 5.
If an event happens in instrument l at metric position k, then the probabilities of an event happening half a bar later (k + 8), 1 bar later (k + 16), 2 bars later (k + 32) or 4 bars later (k + 64) need to be increased. This can be achieved by increasing the parameters a at these future positions (which implies that the mode of the beta distribution and thus the probability of an event increases). Conversely, if there is a rest at position k, then the future probabilities should be reduced, which can be achieved by increasing the parameters b at these future positions (decreasing the mode of the beta distribution and thus the probability of an event).
To this end, we first define the update matrix U, which is a 65 × 65 matrix of 4085 zeros and 140 ones. It takes care of projecting the expectations half a bar, 1 bar, 2 bars, or 4 bars into the future. Matrix U is provided in Fig. 6. For all ones in U, the column number indicates the position in time that affects another position, and the row number indicates the position that is being affected. To make an example: the cell u 9,1 takes value 1, and this means that what happens at k = 1 (column number) affects what happens at k = 9 (row number), which is half a bar later. U can be understood as a hyper-parameter of the model, because it encapsulates listeners' knowledge of the 4/4 m and their experience that patterns tend to repeat with periods of half a bar, 1 bar, etc.
When a stroke is played at metric position k in instrument l, then the probabilities of the corresponding future events need to increase. This is equivalent to increasing the corresponding a parameters of the beta distributions by one and is achieved carrying out the following operation: where A i is the matrix of the coefficients a for stimulus i after learning has taken place, and U, S i , and A 0 are defined as above.
When there is no stroke at metric position k in instrument l, then the probability of the corresponding future events should be decreased. This is equivalent to increasing the b parameters of the beta distributions by one and is achieved by the following matrix operation: where B i is the matrix of the coefficients b for stimulus i after learning has taken place. U, J, S i and B 0 are defined as above. The operation J − S i swaps zeros and ones of the pattern matrix such that all onsets are represented as zeros and all rests as ones. This ensures that the rests are projected forward to the b parameters. The updating process operates similar to the conjugate Bayesian analysis for a binomial proportion with the beta distributions specified by parameters A 0 and B 0 acting as prior distributions. Those specified by A i and B i are the posterior distributions. The only difference is that the counts that contribute to the A i and B i are not observed successes or failures in Bernoulli experiments, but counts of the times a future metric position is made more or less probable based on previous events.
We finally use the parameters A i and B i of the posterior beta distributions to calculate the modes of the model M i , which are point estimates of the probabilities of the internal model after learning stimulus i has taken place. This is achieved using equation (10), which in matrix form becomes: where ⊘ indicates the Hadamard division (element-wise division of two matrices of the same dimension, see Ref. [40]). The probabilities of the updated internal model M 13 after learning Brad Wilk's pattern of "Bravado" can be studied in Table 6. It may be confusing that the matrix operations of equations (12) and (13) update the coefficients A i and B i at once, without reference to the temporality involved (the fact that only onsets at time k − 8, k − 16 etc. affect the probability of an onset at time k). The temporal aspect is encoded in the update matrix U (Fig. 6) which is of lower triangular form (all ones are below the main diagonal). This makes sure that probabilities at every point in time are only affected by the past. This can be verified by comparing M 0 (Table 3) and M 13 (Table 6): you may note that, for the first eight sixteenth notes (k = 1 through k = 8), the two tables are identical, because it needs a temporal distance of at least eight sixteenth notes (or half a bar) until the past affects the present, until learning takes place and the probabilities of matrix M 13 change.

Estimating complexity
With the probabilities of the internal model M i we can now proceed to calculate the surprisal experienced by an enculturated listener who is learning the pattern as it unfolds. Surprisal, as defined in equation (3), is operationalised as follows: where the surprisal matrix Σ i is a 65 × 3 matrix that represents the estimated surprisal for all K = 65 discrete time positions in the L = 3 instrumental layers. As defined above, S i is the matrix representing the sensory data relating to pattern i in binary form; M i is the internal model as a result of short-term learning, J is the 65 × 3 all-ones matrix, and the vertical bars are the absolute operator (see the surprisal matrix Σ 13 of the "Bravado" pattern in Table 7). This operationalisation of surprisal is symmetric, meaning that the presence of an unexpected event is as surprising as the absence of an expected event. Let us for example assume that an event happens at time k in instrument l, then S i(kl) = 1, and let us also assume that this event was highly expected by the internal model, say M i(kl) = 0.9, then we get: This exemplifies the symmetry of estimating surprisal from events and non-events. Further, the calculations for the snare drum, bass drum and cymbals are carried out identically. This means that surprises in the three instrumental layers have the same weight.   We estimate the effort E i of learning stimulus i as the grand sum of all values of the surprisal matrix Σ i in analogy to equation (4): This leads to the effort estimates E i for each of the forty patterns listed in Table 8. Next, we need to estimate the constants C and D, and calculate the estimated complexity values γ i . We intend that the γ i have the same mean and standard deviation as the empirical perceived complexity values β i that were estimated on the basis of the listening experiment in Senn et al. ( [22]). This will make the γ s and β s comparable, not only in terms of their location and spread, but also in their probabilistic interpretation specified by the Bradley-Terry model. Using information from Table 8 we obtain: where β is the vector of the forty β i values, and E is the vector of the forty E i values shown in Table 8. Substituting the values for C and D into equation (6), we obtain the following formula for the estimated complexity γ i of stimulus i:

Table 8
Perceived complexity (Bradley-Terry estimates β i ), estimated effort (E i ), its natural logarithm (log(E i )), and the estimated complexity (γ i = C log(E i ) + D, where C = 2.492 and D = − 6.927). The resulting γ i are presented in Table 1 (and in Table 8). Fig. 3 shows a scatterplot of the estimated complexity values γ i (on the horizontal axis) against the perceived complexity measures β i (on the vertical axis). The gray diagonal line marks γ =β, the locations where estimated complexity agrees exactly with the perceived complexity values. For points above this diagonal, the perceived complexity of the stimulus is greater than estimated by the model. Conversely, for points below the diagonal, perceived complexity is less than estimated. Note that the γ i are just a linear transformation of log(E i ). This means that parameters C and D do not affect the relationship between estimated and perceived complexity (as measured by the Pearson correlation r or the coefficient of determination R 2 ) and are therefore not parameters changing the fit of the model.

Estimating the strength λ of the prior model M 0
As a last step, we need to clarify how the optimal strength λ of the prior distributions was determined. In the context of this model, λ can be understood as a hyper-parameter that is an inverse measure of learning speed. We want to choose λ such that the γ i explain as much of the variance of the β i as possible. The proportion of explained variance is operationalised as the coefficient of determination (R 2 ) between β i and γ i . If we optimise λ on the entire set of 40 drum patterns, we run the risk of overfitting the data, because the hyperparameter λ and the outcome criterion R 2 are estimated on the same sample of forty patterns. In order to minimise overfitting, the following simulation approach was chosen. In each of n = 10000 iterations: 1. The 40 drum patterns were randomly split into a training and a test set, each with 20 drum patterns.
2. An optimal λ was estimated using the training set. This optimisation consisted in varying the strength of the prior between λ = 2.1 and λ = 10.0 in steps of 0.1, calculating the coefficient of determination R 2 with respect to the 20 training set patterns for each value of λ, and selecting the optimal value of λ that maximises R 2 . 3. This optimal λ was then used to estimate R 2 with respect to the 20 test set patterns. Fig. 4 shows histograms of the 10,000 optimal λ (resulting from step 2 of the simulation) and R 2 values (resulting from step 3). The optimal lambdas (a) had a mean of λ = 3.56 and a range of [2.1,7.4]. The empirical distribution of λ had its .025-quantile at 2.7 and its .975-quantile at 4.8; consequently a prediction interval that contains the optimal λ in 95% of the cases can be estimated to span (2.7, 4.8). The simulated R 2 values (b) had a mean of .835, a range of (.510, .968), and a .01-quantile of .677. So, in 99% of the cases R 2 ≥ .677.
For practical reasons, it is useful to provide a heuristic λ value for future complexity estimations. A good candidate is the mean optimal lambda value of λ = 3.56. In order to estimate the fit of the model when the strength of the prior is fixed at λ = 3.56, a second simulation was run with n = 10000 new test sets of 20 drum patterns, randomly drawn from the 40 patterns. For each of these simulated test sets, R 2 was calculated using λ = 3.56. The resulting empirical distribution of R 2 values had a mean of .849 with a range of (.471, .967), and a .01-quantile of .709. Future applications of the model are likely to show a fit with a similar distribution: we can expect that the model predicts perceived complexity with a fit of R 2 ≥ .709 in 99% of new samples drawn from the same population as those used in this study (Western popular music drum patterns, 4 bars duration, 4/4 time signature, instrumentation with bass drum, snare drum, and cymbals). The estimated complexity (γ i ) values for this study's forty drum patterns (see Fig. 3, Tables 1 and 8) were calculated using the mean strength of prior belief of λ = 3.56 and achieved a model fit with the empirical perceived complexity (β i ) values of R 2 = 0.8517591 ≅ 0.852.

Discussion
This study presented a model of perceived popular music drum pattern complexity, based on ideas from predictive coding (PC). In    In defense of the earlier Indices of Syncopation measures it must be acknowleged that they were not specifically designed to reproduce the Bradley-Terry estimates (β i ), and they do not consider the cymbals for their syncopation/complexity calculations. The cymbal layer is quite important explaining perceived complexity (β i ): if we omit the contribution of the cymbal layer and estimate complexity only based on the effort associated with the bass drum and the snare drum, the fit of the model drops from R 2 = .852 to R 2 = .714. This implies that R 2 = .138 are uniquely explained by the effort associated with the cymbals layer. Based on the favourable properties of the model, it may serve as a practical method to reliably estimate the perceived complexity of popular music drum patterns. A dedicated R script that provides the complexity estimate of a drum pattern based on its input pattern matrix S is available (see Data availability statement). In 99% of drum pattern sets drawn from the repertoire, the estimated complexity (γ i ) values can be expected to be good approximations (R 2 ≥ .709) of perceived complexity.

Types of drum pattern complexity
In addition to estimating the overall complexity of a drum pattern, the surprisal function may also serve to describe how drum Fig. 6. Update matrix U with ones highlighted (bold, gray background).

O. Senn
patterns can be complex in different ways. Fig. 2 shows transcriptions of four drum patterns: the first in Fig. 2 (a) is the generic backbeat pattern, and the three others in Fig. 2 (b-d) are examples from the Senn et al. ( [22]) stimuli set. Fig. 5 shows the corresponding surprisal plots, where each value represents the sum of the surprisals caused by the three instrumental layers at each of the 65 discrete time points (the horizontal time axes in Fig. 5 refer to beats measured in quarter notes). Fig. 2 (a) is the transcription of the backbeat pattern, which is the most generic pattern in the Western popular music repertoire. The corresponding surprisal plot of Fig. 5 (a) shows only low surprisal values, which diminish as the pattern is being perceived. The overall estimated complexity of this pattern is low (γ = 0.248). This agrees with intuition: firstly, the backbeat pattern is what listeners expect to hear (see probabilities of the prior model M 0 from Table 3). Secondly, since the pattern repeats literally with a lag of k = 8 discrete time units (half bars), the prediction by the inner model gets better as the pattern unfolds, leading to a further decline of the surprisal function. − Stimulus 8 ( Fig. 2 (b), "Change the World") is essentially a backbeat pattern with a few surprising extra notes. Particularly, drummer John Robinson added bass drum notes in bars 2 and 4, and one additional hi-hat (cymbal) note in bar 4. Fig. 5 (b) clearly marks these events as surprises, as discrete time moments where surprisal is high. These are local surprises in an otherwise generic pattern, which increase the estimated complexity of this pattern (γ 8 = 1.511, see Table 1) compared to the backbeat pattern. − In stimulus 13 (Fig. 2 (c), "Bravado"), the complexity is not triggered by local events. Rather, this drum pattern by Neil Peart has a more complicated basic pattern: a dense hi-hat voice with events on most 16th note positions (whereas the backbeat pattern is based on 8th notes in the hi-hat). Also, the bass drum plays a four-to-the-floor pattern with kicks on every quarter note (compared to only beats one and three in the backbeat pattern). The unexpected notes of this basic pattern increase the surprisal function values, particularly in the first bar, as seen in Fig. 5 (c). Yet, since the pattern is strictly repeating with a period of half a bar, the model M 13  The complexities of all forty drum patterns can to a certain extent be explained as a combination of the complicatedness (or unconventionality) of a basic repeating pattern, which deviates from the generic backbeat and thus from M 0 , and local surprises (or the omission of repetition), which reduce the learnability of the pattern.

Limitations
The model development started from a variety of simplifying assumptions, which affect the realism of the model and thus can be understood as limitations of the model: − The decision to use a discrete time domain with K = 65 fixed time positions, where an event does or does not happen disregards the influence of tempo, microtiming, and dynamics on perceived complexity. It may well be that listeners perceive a pattern as more complex if it is played faster (or slower), or if some notes are played louder (or softer) or a few milliseconds later (or earlier). Yet, this study's model will not detect such differences. − The model assumes that the listener is perfectly aware of meter and tempo from the very first moment when a pattern starts to play.
Within the model, the sensory data S i is automatically mapped onto the enculturated model M 0 . Yet, in most everyday listening situations, meter and tempo need first to be inferred by the listener as the pattern unfolds, and this might incur additional effort (and thus complexity) on the part of the listener (see also [41], on beat and meter perception). This additional effort is not represented in the model. − The model treats the three instrumental voices (bass drum, snare drum, cymbals) equally when it comes to their contribution to the surprisal function. Yet, surprising events in one instrument might have a stronger effect on the listener than events in another instrument (see also [9], Supporting Material Text S2, p. 1). In order to explore this possibility, additional weight parameters were added to the model and optimised in order to evaluate the contributions of the instruments. However, the added parameters led to a negligible improvement of model fit (R 2 = .854) and were discarded. − The model does not consider the properties of short-time memory: an event (or no-event) at time k affects the probabilities of the events at k + 8, k + 16, k + 32, and k + 64 in exactly the same way, without regard to limited detention times in short-term memory ( [42]). Listeners might remember events that happened half a bar ago more vividly than events that happened 4 bars ago. − The model assumes an understanding of drum pattern complexity that is compatible with the above definitions of perceived complexity and estimated complexity relating to pairwise comparison trials and the Bradley-Terry probability model. The model will potentially have a worse fit with perceived complexity measures that have been obtained using different measurement methods (e. g. Likert rating).
Further, there are two major aspects that limit a generalised use of the model: firstly, the modelling work is based on only 40 drum patterns. It is difficult to assess how the potential idiosyncrasies of the selected patterns affect the model parameters and hence the O. Senn ability of the model to predict the perceived complexity of drum patterns randomly drawn from the repertoire. It would be beneficial to replicate this study based on a larger sample of drum patterns. Secondly, the model can only be applied in a limited number of cases, due to the simplifying assumptions and the specific nature of the empirical data used to fit the model. The model allows to estimate perceived complexity of popular music drum patterns of 4 bars duration (plus beat one of the fifth bar), in common 4/4 time, with binary subdivisions on all levels down to the 16 th note level, and with an instrumentation of bass drum, snare drum, and cymbals. The model also assumes that the listener has previous knowledge of popular music drum patterns (M 0 ) and is aware of the meter and of the tendency of patterns to repeat with certain periodicities (U). In all other situations, the model is not applicable. Luckily, even within these constraints, the model might nevertheless be helpful in practice, since many drum patterns used as stimuli in music psychological studies correspond to this description or can quite easily be converted to this format, and a majority of listeners has heard this kind of drum patterns in the past.

Conclusions
This study modelled the complexity of Western popular music drum patterns with three different instrumental voices (bass drum, snare drum, cymbals) based on the ideas of predictive coding, a general theory of perception and learning. To my knowledge, implementing ideas from cognitive theory in this way is a novelty in the modelling of musical complexity. Existing complexity models in the field are usually based on domain-specific music theoretic concepts such as syncopation (see Ref. [9]), or on information theoretic concepts such as Kolmogorov complexity (see Ref. [13]), but not on cognitive theories.
The model involves ideas on enculturation and a Bayesian learning process, which, I think, is a novelty as well. It firstly derives listeners' expectations from a large corpus of drum patterns, thus simulating the enculturation of the listeners (for a similar approach to simulating enculturation in rhythm perception, see Ref. [43]). It secondly uses a short-term learning mechanism that adapts the listeners' expectations at each moment in a pattern based on previous events. Finally, it quantifies surprisal based on the prediction errors and uses surprisal to calculate an estimated complexity score that is compatible with the perceived complexity measures presented in Senn et al. ( [22]).
The successful use of predictive coding theory in order to model drum pattern complexity shines a light back on predictive coding. The good fit of the model provides circumstantial evidence to the growing number of studies indicating that predictive coding is not only an elegant and parsimonious theory of the mind and of cognitive processes, but that it also explains empirical findings from behavioural and neurological studies with remarkable success (see Refs. [27,[44][45][46]). In the present study, a rigourous implementation of one basic predictive coding idea led to a very good model relying on the optimisation of only one single parameter. This is a an indirect indication that predictive coding theory in general and the surprisal function in particular indeed grasps essential aspects of perceived drum pattern complexity.
Percepts are often complex in many different ways. Music, for example, may be complex due to its rhythmic properties, but also harmony, melody, polyphony, instrumentation, or any other musical dimension may add to complexity. The complexity of the entire percept can be conceived as an aggregate of all surprises occurring in the different dimensions affecting one single surprisal function in the listener. The aggregation of complexities might even encompass different sensory modalities: both visual and auditory information might affect the perceiver's surprisal function simultaneously and contribute to the complexity of dance or film. Finally, a surprisalbased theory of complexity might also be able to account for enculturation (an enculturated perceiver potentially starts from a more accurate internal model, whereas a stranger to the culture might start from an idadequate model) or domain-specific expertise (experts might learn faster than non-experts). The surprisal-based theory of perceived complexity may be expanded beyond music to other arts and to further domains of perception, and it might potentially represent the nucleus of a general theory of perceived complexity.

Author contribution statement
Olivier Senn: Conceived and designed the study; Created the mathematical model; Analyzed and interpreted the data; Contributed materials, analysis tools or data; Wrote the paper.

Funding statement
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Data availability statement
Data associated with this study has been deposited at https://zenodo.org/record/7520032#.

Additional information
No additional information is available for this paper.

Declaration of competing interest
The author declares that he has no known competing financial interests or personal relationships that could have appeared to O. Senn influence the work reported in this paper.