Isolated guitar transcription using a deep belief network

Music transcription involves the transformation of an audio recording to common music notation, colloquially referred to as sheet music. Manually transcribing audio recordings is a difficult and time-consuming process, even for experienced musicians. In response, several algorithms have been proposed to automatically analyze and transcribe the notes sounding in an audio recording; however, these algorithms are often general-purpose, attempting to process any number of instruments producing any number of notes sounding simultaneously. This paper presents a polyphonic transcription algorithm that is constrained to processing the audio output of a single instrument, specifically an acoustic guitar. The transcription system consists of a novel note pitch estimation algorithm that uses a deep belief network and multi-label learning techniques to generate multiple pitch estimates for each analysis frame of the input audio signal. Using a compiled dataset of synthesized guitar recordings for evaluation, the algorithm described in this work results in an 11% increase in the f-measure of note transcriptions relative to Zhou et al.'s transcription algorithm in the literature. This paper demonstrates the effectiveness of deep, multi-label learning for the task of polyphonic transcription.


19
Music transcription is the process of converting an audio signal into a music score that informs a musician 20 which notes to perform and how they are to be performed. This is accomplished through the analysis of 21 the pitch and rhythmic properties of an acoustical waveform. In the composition or publishing process, 22 manually transcribing each note of a musical passage to create a music score for other musicians is a 23 labour-intensive procedure (Hainsworth and Macleod, 2003). Manual transcription is slow and error-24 prone: even notationally fluent and experienced musicians make mistakes, require multiple passes over 25 the audio signal, and draw upon extensive prior knowledge to make complex decisions about the resulting 26 transcription . 27 In response to the time-consuming process of manually transcribing music, researchers in the multi-28 disciplinary field of music information retrieval (MIR) have summoned their knowledge of computing 29 science, electrical engineering, music theory, mathematics, and statistics to develop algorithms that aim 30 to automatically transcribe the notes sounding in an audio recording. Although the automatic transcrip-31 tion of monophonic (one note sounding at a time) music is considered a solved problem (Benetos et al.,32 2012), the automatic transcription of polyphonic (multiple notes sounding simultaneously) music "falls 33 clearly behind skilled human musicians in accuracy and flexibility" (Klapuri, 2004). In an effort to 34 reduce the complexity, the transcription problem can be constrained by limiting the number of notes 35 that sound simultaneously, the genre of music being analyzed, or the number and type of instruments 36 producing sound. A constrained domain allows the transcription system to "exploit the structure" (Mar-37 tin, 1996) by leveraging known priors on observed distributions, and consequently reduce the difficulty 38 of transcription. This parallels systems in the more mature field of speech recognition where practical 39 algorithms are often language, gender, or speaker dependent (Huang et al., 2001). 40 Automatic guitar transcription is the problem of automatic music transcription with the constraint that 41 the audio signal being analyzed is produced by a single electric or acoustic guitar. Though this problem publish and share transcriptions in the form of tablature rather than common western music notation. 48 Therefore, automatic guitar transcription algorithms should also be capable of producing tablature. Gui-49 tar tablature is a symbolic music notation system with a six-line staff representing the strings on a guitar. 50 The top line of the system represents the highest pitched (thinnest diameter) string and the bottom line 51 represents the lowest pitched (thickest diameter) string. A number on a line denotes the guitar fret that 52 should be depressed on the respective string. An example of guitar tablature below its corresponding 53 common western music notation is presented in Figure 1. impose constraints on the polyphony of pitch estimates at any given time. In response to these points, 72 the pitch estimation algorithm described in this work uses a deep belief network in conjunction with 73 multi-label learning techniques to produce multiple pitch estimates for each audio analysis frame. 74 After estimating the pitch content of the audio signal, existing algorithms in the literature are used to 75 track the temporal properties (onset time and duration) of each note event and convert this information 76 to guitar tablature notation.

78
The first polyphonic transcription system for duets imposed constraints on the frequency range and tim- 79 bre of the two input instruments as well as the intervals between simultaneously performed notes (Moorer, 80 1975). 1 This work provoked a significant amount of research on this topic, which still aims to further the 81 accuracy of transcriptions while gradually eliminating domain constraints. 82 In the infancy of the problem, polyphonic transcription algorithms relied heavily on digital signal 83 processing techniques to uncover the fundamental frequencies present in an input audio waveform. To 84 this end, several different algorithms have been proposed: perceptually motivated models that attempt 85 to model human audition (Klapuri, 2005); salience methods, which transform the audio signal to ac-86 centuate the underlying fundamental frequencies (Klapuri, 2006;Zhou et al., 2009); iterative estimation 87 1 Timbre refers to several attributes of an audio signal that allows humans to attribute a sound to its source and to differentiate between a trumpet and a piano, for instance. Timbre is often referred to as the "colour" of a sound. methods, which iteratively select a predominant fundamental from the frequency spectrum and then sub-88 tract an estimate of its harmonics from the residual spectrum until no fundamental frequency candidates 89 remain (Klapuri, 2006); and joint estimation, which holistically selects fundamental frequency candi-90 dates that, together, best describe the observed frequency domain of the input audio signal (Yeh et al.,91 2010).

92
The MIR research community is gradually adopting a machine-learning-centric paradigm for many 93 MIR tasks, including polyphonic transcription. Several innovative applications of machine learning al-94 gorithms to the task of polyphonic transcription have been proposed, including hidden Markov models 95 (HMMs) (Raphael, 2002), non-negative matrix factorization (Smaragdis and Brown, 2003;Dessein et al., 96 2010), support vector machines (Poliner and Ellis, 2007), artificial shallow neural networks (Marolt,97 2004) and recurrent neural networks (Boulanger-Lewandowski, 2014). Although each of these algo-98 rithms operate differently, the underlying principle involves the formation of a model that seeks to cap-99 ture the harmonic, and perhaps temporal, structures of notes present in a set of training audio signals.

119
Some models for chord and pitch estimation attempt to produce the fingering or chords of a guitar 120 rather than the notes themselves. Barbancho  and at what point are they pressed on the guitar fretboard. This model attempts to recover fingering 129 immediately rather than build it or arrange fingering later. The authors report a frame-wise recognition 130 rate of 77.42%.

131
After note pitch estimation it is necessary to perform note tracking, which involves the detection 132 of note onsets and offsets (Benetos and Weyde, 2013). Several techniques have been proposed in the 133 literature including a multitude of onset estimation algorithms (Bello et al., 2005;Dixon, 2006), HMM 134 note-duration modelling algorithms Ryynänen and Klapuri, 2005), and an HMM 135 frame-smoothing algorithm (Poliner and Ellis, 2007). The output of these note tracking algorithms are 136 a sequence of note event estimates, each having a pitch, onset time, and duration.

137
These note events may then be digitally encoded in a symbolic music notation, such as tablature 138 notation, for cataloguing or publishing. Arranging tablature is challenging because the guitar is ca-139 pable of producing the same pitch in multiple ways. Therefore, a "good" arrangement is one that is 140 biomechanically easy for the musician to perform, such that transitions between notes do not require ex-  to the desired network output (Hinton, 2007).

157
In order to pretrain the network weights in an unsupervised fashion, it is necessary to think of the net-158 work as a generative model rather than a discriminative model. A generative model aims to form an 159 internal model of a set of observable data vectors, described using latent variables; the latent variables 160 then attempt to recreate the observable data vectors with some degree of accuracy. On the other hand, a 161 discriminative model aims to set the value of its latent variables, typically used for the task of classifica-162 tion or regression, without regard for recreating the input data vectors. A discriminative model does not 163 explicitly care how the observed data was generated, but rather focuses on producing correct values of 164 its latent variables. 165 Hinton et al. (2006) proposed that a deep neural network be composed of several restricted Boltzmann machines (RBMs) stacked on top of each other, such that the network can be viewed as both a generative model and a discriminative model. An RBM is an undirected bipartite graph with m visible nodes and n hidden nodes, as depicted in Figure 2. Typically, the domain of the visible and hidden nodes are binary such that v ∈ {0, 1} m and h ∈ {0, 1} n , respectively, such that where W ∈ R n×m is the matrix of weights between the visible and hidden nodes. For simplicity, Equa-  . Workflow of the proposed polyphonic transcription algorithm, which converts the recording of a single instrument to a sequence of MIDI note events that are then translated to tablature notation.
Supervised Fine-tuning 174 The unsupervised pretraining of the stacked RBMs is a relatively efficient method that sets good initial 175 values for the network weights. Moreover, in the case of a supervised learning task such as classification 176 or regression, the ground-truth labels for each training data vector have not yet been considered. The 177 supervised fine-tuning step of the DBN addresses these issues.

178
One method of supervised fine-tuning is to add a layer of output nodes to the network for the purposes 179 of (logistic) regression and to perform standard back-propagation as if the DBN was a multi-layered 180 neural network (Bengio, 2009). Rather than creating features from scratch, this fine-tuning method is 181 responsible for modifying the latent features in order to adjust the class boundaries (Hinton, 2007).

182
After fine-tuning the network, a feature vector can be fed forward through the network and a result 183 realized at the output layer. In the context of pitch estimation, the feature vector represents the frequency 184 content of an audio analysis frame and the output layer of the network is responsible for classifying the 185 pitches that are present.

187
The workflow of the proposed polyphonic transcription algorithm is presented in Figure 3. The algorithm 188 consists of an audio signal preprocessing step, followed by a novel DBN pitch estimation algorithm. The occurs. This sequence of note events is then translated to guitar tablature notation using the graph-search

197
The input audio signal is first preprocessed before feature extraction. If the audio signal is stereo, the 198 channels are averaged to produce a mono audio signal. Then the audio signal is decimated to lower 199 the sampling rate f s by an integer multiple, k ∈ N + . Decimation involves low-pass filtering with a cut- features Φ ∈ [0, 1] n×m , such that n is the number of analysis frames spanning the input signal.

213
The DBN consumes these normalized audio features; hence, the input layer consists of m nodes.

214
There can be any number of stochastic binary hidden layers, each consisting of any number of nodes.

215
The output layer of the network consists of k + p nodes, where the first k nodes are allocated for pitch 216 estimation and the final p nodes are allocated for polyphony estimation. The network uses a sigmoid 217 activation as the non-linear transfer function.

218
The feature vectors Φ are fed forward through the network with parameters Θ, resulting in a matrix of probabilities P(Ŷ |Φ, Θ) ∈ [0, 1] k+p that is then split into a matrix of pitch probabilities P(Ŷ (pitch) |Φ, Θ) and polyphony probabilities P(Ŷ (poly) |Φ, Θ). The polyphony of the i th analysis frame is estimated by

Manuscript to be reviewed
Computer Science selecting the polyphony class with the highest probability using the equation (2) Pitch estimation is performed using a multi-label learning technique similar to the MetaLabeler sys- tion is conducted on the network. Rather than creating features from scratch, this fine-tuning method is 239 responsible for modifying the latent features in order to adjust the class boundaries (Hinton, 2007).

240
The canonical error function to be minimized for a set of separate pitch and polyphony binary classifications is the cross-entropy error function, which forms the training signal used for backpropagation: The aim of this objective function is to adjust the network weights Θ to pull output node probabilities 241 closer to one for ground-truth label bits that are on and pull probabilities closer to zero for bits that are 242 off.

243
The described pitch estimation algorithm was implemented using the Theano numerical computation .

248
Note Tracking

249
Although frame-level pitch estimates are essential for transcription, converting these estimates into note 250 events with an onset and duration is not a trivial task. The purpose of note tracking is to process these 251 pitch estimates and determine when a note onsets and offsets.  The MIDI file output by the algorithm thus far contains the note event (pitch, onset, and duration) tran-282 scriptions of an audio recording. However, a MIDI file lacks certain information necessary to write sheet 283 music in common western music notation such as time signature, key signature, clef type, and the value 284 (duration) of each note described in divisions of a whole note.

285
There are several robust opensource programs that derive this missing information from a MIDI file 286 using logic and heuristics in order to generate common western music notation that is digitally encoded in 287 the MusicXML file format. MusicXML is a standardized extensible markup language (XML) definition 288 allowing digital symbolic music notation to be universally encoded and parsed by music applications.

289
In this work, the command line tools shipped with the opensource application MuseScore are used to 290 convert MIDI to common western music notation encoded in the MusicXML file format. 2

291
The graph-based guitar tablature arrangement algorithm developed by Burlet and Fujinaga (2013) 292 is used to append a guitar string and fret combination to each note event encoded in a MusicXML 293 transcription file. The guitar tablature arrangement algorithm operates by using Dijkstra's algorithm to 294 search for the shortest path through a directed weighted graph, in which the vertices represent candidate 295 string and fret combinations for a note or chord, as displayed in Figure 6. 296 The edge weights between nodes in the graph indicate the biomechanical difficulty of transitioning between fretting-hand positions. Three biomechanical complexity factors are aggregated to form each edge weight: the fret-wise distance required to transition between notes or chords, the fret-wise finger span required to perform chords, and a penalty of one if the fretting hand surpasses the seventh fret. Figure 6. A directed acyclic graph of string and fret candidates for a note and chord followed by two more notes. Weights have been omitted for clarity. The notation for each node is (string number, fret number).
The value of this penalty and fret threshold number were determined through subjective analysis of the resulting tablature arrangements. In the event that a note is followed by a chord, the fret-wise distance is calculated by the expression such that f ∈ N is the fret number used to perform the note and g is a vector of fret numbers used to   Recall:  f -measure: such that p and r is the precision and recall calculated using Equation 5 and Equation 6, respectively. The Manuscript to be reviewed

Computer Science
Polyphony recall: such that 1{·} is an indicator function that returns 1 if the predicate is true, and n is the number of 313 audio analysis frames being evaluated. In other words, this equation calculates the number of correct 314 polyphony estimates across all audio analysis frames divided by the number of analysis frames.

315
One error: given the matrix of pitch probabilities P(Ŷ (pitch) |Φ, Θ) ∈ [0, 1] n×k output by the DBN with model parameters Θ when processing the input audio analysis frame features Φ, the predominant pitch of the i th audio analysis frame is calculated using the equation which can then be used to calculate the one error: such that 1{·} is an indicator function that maps to 1 if the predicate is true. The one error calculates 316 the fraction of analysis frames in which the top-ranked label is not present in the ground-truth label set.

317
In the context of pitch estimation, this metric provides insight into the number of audio analysis frames 318 where the predominant pitch-often referred to as the melody-is estimated incorrectly.

319
Hamming loss: such that n is the number of audio analysis frames, k is the cardinality of the label set for each analysis such thatN is the set of estimated note events and N is the set of ground-truth note events.

335
Recall: such thatN is the set of estimated note events and N is the set of ground-truth note events.
such that p and r are calculated using Equation 12 and Equation 13, respectively.

337
The criteria for a note event being correct, as compared to a ground-truth note event, are as follows:

338
• The pitch name and octave number of the note event estimate and ground-truth note event must be 339 equivalent.

340
• The note event estimate's onset time is within ±250ms of the ground-truth note event's onset time.

341
• Only one ground-truth note event can be associated with each note event estimate.

342
The offset time of a note event is not considered in the evaluation process because offset times exhibit whether machine learning transcription algorithms also succumb to a high number of octave errors.

352
The polyphonic transcription algorithm described in this paper is evaluated on a new dataset of synthesized guitar recordings. Before processing these guitar recordings, the number of pitches k and maximum polyphony p of the instrument must first be calculated in order to construct the DBN. Knowing that the input instrument is a guitar with six strings, the pitch estimation algorithm considers the k = 51 pitches from C2-D6, which spans the lowest note capable of being produced by a guitar in Drop C tuning to the highest note capable of being produced by a 22-fret guitar in Standard tuning. Though a guitar with six strings is only capable of producing six notes simultaneously, a chord transition may occur within a frame and so the maximum polyphony may increase above this bound. This is a technical side effect of a sliding-window analysis of the audio signal. Therefore, the maximum frame-wise polyphony is calculated from the training dataset using the equation where 1 is a vector of ones. The addition of one to the maximum polyphony is to accommodate silence 353 where no pitches sound in an analysis frame.

354
The experiments outlined in this section will evaluate the accuracy of pitch estimates output by        Table 1 on multiple Guitar models.   institutions on the same dataset, is conducted using the metrics of precision, recall, and f -measure. 5

449
Relative to a ground-truth note event, an estimate is considered correct if its onset time is within 250ms  DFT power spectrum f -measure ‡ † Denotes the independent variable. ‡ Denotes the dependent variable.

Results and Discussion:
The hypothesis speculated that increasing the number of hidden layers, and 540 consequently the number of model parameters, would increase frame-level pitch estimation f -measure.

541
Given prior work in deep neural networks, the depth of the network is often viewed as providing an 542 advantage to the model over a shallow neural network. Thus it is reasonable to assume that increasing 543 the number of hidden layers in the deep network will yield increasingly better results; however, the 544 results presented in Table 5 provide evidence supporting the contrary.

545
The results invalidate the hypothesis and suggest that a more complex model, with more layers  train on synthesized models we potential overfit to a model of a guitar rather than a guitar playing.

613
Furthermore synthetic examples typically will be timed well with little clock slew or swing involved.

614
Real recordings of guitar music will start at different times, change tempo, and have far more noise or randomness in synthesized output should be a concern. Synthesis also makes assumptions in terms  The polyphonic transcription algorithm described in this paper is capable of forming discriminative,