The Jazz Ontology: A semantic model and large-scale RDF repositories for jazz

Jazz is a musical tradition that is just over 100 years old; unlike in other Western musical traditions, improvisation plays a central role in jazz. Modelling the domain of jazz poses some ontological challenges due to speciﬁcities in musical content and performance practice, such as band lineup ﬂuidity and importance of short melodic patterns for improvisation. This paper presents the Jazz Ontology – a semantic model that addresses these challenges. Additionally, the model also describes workﬂows for annotating recordings with melody transcriptions and for pattern search. The Jazz Ontology incorporates existing standards and ontologies such as FRBR and the Music Ontology. The ontology has been assessed by examining how well it supports describing and merging existing datasets and whether it facilitates novel discoveries in a music browsing application. The utility of the ontology is also demonstrated in a novel framework for managing jazz related music information. This involves the population of the Jazz Ontology with the metadata from large scale audio and bibliographic corpora (the Jazz Encyclopedia and the Jazz Discography). The resulting RDF datasets were merged and linked to existing Linked Open Data resources. These datasets are publicly available and are driving an online application that is being used by jazz researchers and music lovers for the systematic study of jazz.


Introduction
Jazz is an important musical tradition with a worldwide community of millions of people enjoying, playing, and studying it. While it is primarily considered as a genre of Western music, it occupies a special position due 5 to the central role of improvisation, which is not common in other Western music genres [1]. Originating as a mixture of traditional African-American styles with popular genre [2,3,4]. During its history of about 100 years [5], jazz has influenced and been influenced by other musical styles across cultures around the world [6,7,8]. With this rich tradition, jazz is a worthwhile object of semantic modelling but it also poses challenges in the construction 15 of musical ontologies.
Jazz is recorded, produced, sold, and consumed in a similar fashion to other Western musical genres, therefore discographic and bibliographic metadata are to a significant extent in line with those. At the same time, its unique 20 musical and performance characteristics break the limits of existing semantic models: e. g., musicians often play more than one instrument, the lineup of a band can change from performance to performance, there is a special role of improvised solos and soloists, which require dedicated mod- 25 elling. Scores are sometimes used, but in a much looser way than in Western classical music, often in the form of lead sheets which only indicate the basic melody of the tune and its harmonic structure.
For a comprehensive modelling of the jazz domain, sev- 30 eral subdomains have to be considered presenting a variety of data modalities and formats. Jazz as a musical sound is the subject of analysis by jazz musicians and musicol-with semi-automatic analyses of musicians' interviews and to add new relationships [16]. Our recent project, Dig That Lick: Analysing Large-Scale Data for Melodic Patterns in Jazz Performances 2 (DTL) was a two-year endeavour within the fourth Trans- 70 Atlantic Program Digging into Data Challenge. It addressed jazz on a larger scale, combining various aspects of jazz and data modalities in an interdisciplinary approach. The work was concerned with automatic analysis of melodic patterns ("licks") in jazz improvisations, aim- 75 ing to trace musical influence based on borrowing of licks [17]. The analysis workflow included automatic melody extraction from 50K audio recordings [18,19]; segmenting tracks into solo and other parts; and advanced pattern search on symbolic representations of solos [20]. This paper describes modelling, collecting, integrating, correcting, enriching metadata and linking it to audio.
In the next section, we outline previous work in semantic modelling and dataset development relevant to music (FRBR, Music Ontology) and in particular to jazz 85 (Weimar Jazz Database, LinkedJazz). We lay out a methodological framework that guided us during planning, mod-1 https://wiki.musicbrainz.org/LinkedBrainz 2 http://dig-that-lick.eecs.qmul.ac.uk elling and populating the Jazz Ontology in Section 3. Section 4 presents the semantic model of the Jazz Ontology: the basic model which represents and relates the disco-90 graphic and the sessionographic information about jazz, as well as further additions modelling jazz-specific phenomena. Section 5 describes the process of ontology population for three jazz datasets, including dataset merging and enrichment, while Section 6 discusses the assessment 95 of the ontology in terms of formal requirements and in-use validation. We wrap up with a discussion and future work suggestions.

Related work and datasets
One of the main standards for semantic modelling in 100 cultural heritage is FRBR (Functional Requirements for Bibliographic Records). It is a conceptual model for describing entities and relationships in libraries, museums, and archives [21]. FRBR was developed by the International Federation of Library Associations and Institutions 105 (IFLA) and is widely used by cultural institutions around the world, in particular for electronic cataloguing of physical and digital objects. As a conceptual model, it is separate from the language, cataloguing standards, or the actual implementation system and provides the basis for in-110 teroperability between holdings, collections, and datasets [22]. FRBR Group 1 defines four main entities to represent the products of intellectual or artistic endeavour: "Work (a distinct intellectual or artistic creation) and expression (the intellectual or artistic realization of a work) reflect 115 intellectual or artistic content. Manifestation (the physical embodiment of an expression of a work) and item (a single exemplar of a manifestation) reflect physical form." Group 2 includes persons and corporate bodies responsible for the custodianship of Group 1 intellectual or artistic 120 endeavours (e. g., creators, consumers). Group 3 includes events and places [23] (Fig. 1). While too general in its pure form, FRBR is relevant for music [24] and in particular for jazz, and has been widely implemented or referred to in existing semantic models of music. It has been rep-125 resented as an OWL ontology. 3 The Music Ontology (MO) [13,12] is among the most comprehensive ontologies for the music domain, providing a vocabulary for publishing and linking a wide range of music-related data on the Web. 4 It extends the FRBR 130 model and provides an event-based conceptualisation of music creation, performance, production and consumption, including aspects such as discographic information, production processes as well as music-related events. The Music Ontology builds upon four broadly accepted domain 135 models adapted to the music domain: • FRBR Ontology 5 (Fig. 1), 3 http://vocab.org/frbr/core# 4 http://musicontology.com 5 http://vocab.org/frbr/core.html  A broad range of applications [25,26] have been implemented based on the Music Ontology, from recommendation systems [27,28] to live performance [29], and numer-145 ous extensions covering music production 9 [30,31,32,33], audio effects [34,35], audio features 10 [36], musical instruments [37], transformation and redistribution of audio content 11 [38,39], music theoretical concepts [40, 41,42], live music archives [43], smart instruments and more generic 150 "Musical Things" [44]. The Music Ontology has been successfully applied to a variety of musical content including many commercial pop music genres, electronic and classical music. Moreover, it was found to generalise well to a number of non-Western musical traditions [45,46].

155
Our model builds strongly on the Music Ontology as its main reference, yet some aspects like frequently changing band lineups and the prevalence of multi-instrumentalists are specific for the genre and need more detailed modelling.
Turning our attention to data sources relevant in the 160 jazz domain, MusicBrainz is the largest crowd-sourced collection of music metadata online, mainly focused on discographic information about published CDs. It is widely 6 http://purl.org/NET/c4dm/timeline.owl 7 http://purl.org/NET/c4dm/event.owl 8 Friend of a Friend ontology, describing relationships between persons: http://xmlns.com/foaf/spec/ 9 http://isophonics.net/content/studio-ontology 10 https://w3id.org/afo/onto/ 11 https://w3id.org/ac-ontology/aco used by applications and music consumers to automatically add metadata to their downloaded digital tracks 121314 165 [48,49]. The data is ingested according to detailed guidelines, and curated by volunteers, therefore it is quite consistent, though errors cannot be excluded. An API is provided for data retrieval and sharing. The MusicBrainz data model has been expressed in OWL and the data is 170 available in RDF format. MusicBrainz is an important Linked Open Data resource for us even though its focus on releases and CD metadata does not meet the core needs of our ontology and its applications. For interoperability, we kept our data model consistent with MusicBrainz where 175 possible. Another important aspect of MusicBrainz is its fingerprinting algorithm, 15 which we use to identify track duplicates in our collection.
The Weimar Jazz Database is a collection of manually transcribed and richly annotated jazz solos. It was pro-180 duced as part of the Jazzomat Project at the University of Music "Franz Liszt" Weimar in Germany between 2012 and 2017 and has become an important corpus for systematic jazz research [14]. The database comprises 456 instrumental jazz solos from 343 different recordings, and 185 provides musical content annotations for meter, structural segmentation, measures, beats, chord labels, style, solo instrument and more. Discographic metadata is also included with the tracks. The Jazzomat data model has been developed with a detailed manual analysis of musi-190 cal content as the main goal. Data is disseminated as an SQLite database; it has been represented in OWL as part of the JazzCats project [50]. The data released as part of Dig That Lick (see Section 1) and used to evaluate the Jazz Ontology builds on the findings and experience from 195 the Jazzomat project, taking its approach into the realm of big data (tens of thousands of tracks) by automating manual tasks such as solo transcription.
LinkedJazz 16 is a research project at the Pratt Semantic Lab concerned with Linked Open Data, and in partic-200 ular documents and data related to the personal and professional lives of jazz artists [15,51,16]. The researchers crawled all the main Linked Open Data resources such as DBpedia, 17 Library of Congress authority files, 18 Mu-sicBrainz, 19 Virtual International Authority File, 20 to col-205 lect and link all entries related to jazz artists. The outcomes of automatic discovery (around 9,000 artists) were 12 [52] curated by researchers and volunteers by means of a dedicated application. LinkedJazz researchers created a data model specifically describing jazz artists and relationships 210 between them, implementing FOAF and the Music Ontology, and extending these further. We build upon the findings and the data model of this pioneering project, which enables us to easily link our ontology to existing Linked Open Data on jazz musicians. Only a small proportion of 215 musicians in our data are currently referenced online -we hope to change that with our datasets. The JazzCats 21 [53] project successfully interlinked disparate jazz-related datasets by means of Semantic Technologies. The datasets in question were the Weimar Jazz 220 Database, the LinkedJazz repository and the Body&Soul dataset 22 which provides a discography for over 200 performances of "Body and Soul" recorded between 1930 and 2004 [54]. A follow-up project further added connections to two other musicological datasets describing concert life 225 in London in the 18th and 19th centuries [55,50]. The goal was to connect existing datasets, not to model the domain of jazz with its specific characteristics which differentiate it from other musical domains. Whereas Body&Soul implemented the Music Ontology for its discographic relation- with further relevant relationships among musicians (Fig.  3). In contrast, the Weimar Jazz Database did not implement existing ontologies or schemas and many classes were project specific. In this case, RDF was produced from 235 an SQLite3 database via an automated process. Given the very rich solo annotations produced in the Jazzomat project, most of them would not normally be available for other data sources, or different types of content annotations might be used.

240
In contrast to previous works with a stronger emphasis on interlinking data, we took a top-down approach, aiming to model the domain of jazz, first concentrating on elements which most jazz-related projects and datasets would have in common. We also focus on jazz-specific character-245 istics which have not been covered by more general models, such as fluidity of band line-ups and prevalence of multiinstrumentalists, as well as the centrality of live performance, soloing and improvisation. The ontological models and knowledge organisation approach proposed in this pa-250 per may transfer over a range of domains and applications that makes use of semantic models of music data. These include frameworks that bind applications targeting different stakeholders in music production and consumption, for instance, to navigate music collections or reproduce 255 music recordings adaptively using metadata [56,57]. Our work may also contribute to broader efforts supporting music informatics research through semantic integration of datasets [58,59,60].

260
In this section, we describe the Jazz Ontology in accordance with the MIRO Guidelines for Ontology Reporting [61]. We address the compulsory sections of the Guidelines keeping their numbering (such as A.1: Ontology owner or E.5: Entity naming convention), and, where relevant, The semantic model and the ontologies presented in this paper were developed to enable systematic study of jazz, in particular queries across large, heterogeneous metadata repositories; to support metadata correction and disambiguation; to track workflows and provenance in man-285 ual and automatic content metadata creation; to support linking audio and other media to clean metadata (B.1). The main aim of the Jazz Ontology is to provide interoperability between ontologies and datasets. It was not conceptualised as an explicit knowledge representation model.

290
Related work is presented in Section 2 (B.2). The target audience are jazz researchers and enthusiasts, libraries, archives, and jazz discographers (B.3).
In developing our ontology, we broadly followed the METHONTOLOGY methodological framework [62] that 295 identifies six phases: 1. specification; 2. knowledge acquisition; 3. conceptualization of an informal model (e. g., formalization of the ontology in OWL); 4. integration of existing ontologies; 5. implementation of the ontology; and 6. evaluation. 300 No plan has been made to develop the ontology further beyond the lifetime of Dig That Lick, however its publication and documentation allows interested parties to take on further development, while further projects on related topics are planned by consortium partners (F.1).

305
Minor changes such as qualifying a property or adding an attribute would amount to a new subversion; more substantial additions (e. g., a new class) or deletions require a new version (F.2, F.3). Versioning and discussion can take place on GitHub. For the aims of the Dig That Lick project we were broadly interested in questions such as: • When, where and by whom was a given lick/pattern played?

315
• Were licks based on a given pattern popular at a particular time and place? • Does a given lick/pattern appear more often during certain stages of a musician's career?
Beyond these research-specific goals, we also aimed to be able to answer general questions about discography and sessionography, as well as relationships between the artists 325 involved. These included: • Where and when were the tracks on this CD recorded?
• What is the band line-up for the given performance? Who played which instruments?
• Which bands have played/recorded a given tune?

330
The full list of competency questions is given in Section S1 of the Supplement. The model should also account for the heterogeneity of the resources, varied data quality, and possible uncertainties in the data. Alongside conventional provenance, the 335 origin of the annotations is also important in musicological research. The above use cases were expressed in the following formal requirements (FR): FR1 Represent discographic concepts such as Track, Release, Album, Label.

340
FR2 Represent sessionographic information: Session with attributes for Date, Place, Band, Band Lineup, Musical Instruments; Tunes played in the Session.
FR3 Represent musician relationships, such as pairs of musicians who played together, toured together, or 345 where one was band leader of the other.
FR4 Represent jazz performance segmentation, in particular, the Solo, as well as the concept of the Lick.
FR5 Represent provenance of the data, in particular the origin of annotations, and the workflow for automat-350 ically created data.
The entities of our semantic model are defined in Section S2 of the Supplement. For a detailed semantic model description see Section 4.

355
Knowledge acquisition was conducted by means of literature reviews and focus groups with jazz experts. Literature reviews were compiled by the project partners to scope the overall domain of jazz with the focus on Dig That Lick's objectives [63,64,65,66,67,68], in particular, to 360 understand the perceptual process and the practicalities of improvisation in jazz [69,70,71,72], lick creation and J o u r n a l P r e -p r o o f Journal Pre-proof borrowing, stylistic influence [73,1,17,74], and the discographic and the sessionographic documentation of jazz. 26 Additionally, focus groups with jazz experts were conducted at the initial stage to formulate ontology specifications (Section 3.1) [75,76]. The main themes that emerged from the initial stage of knowledge acquisition were the importance of sessions and sessionography, the prominence of band leaders, the fluidity of band lineups, the prevalence 370 of multi-instrumentalists among jazz musicians. At more advanced stages of the research, focus groups with jazz experts were used to confirm the broader concepts (Section S2 of the Supplement), to discuss, test and improve the models illustrated by the entity-relationship diagrams 375 presented in this paper (Section 4) to ensure the correct modelling of the datasets metadata (Sections 5.1 and 5.2), and to validate data inference, merging (Section 5.4) and enrichment (Section 5.6).

Ontology content 380
The integration of relevant models listed in Section 2 is achieved through re-use or subclassing. We directly reuse entities from relevant external ontologies, which makes the Jazz Ontology dependent on them and might require remodelling if any of the incorporated ontologies change [77].

385
For musical entities, Music Ontology classes are re-used or, where further constraints are present, are expressed as subclasses of Music Ontology classes (E.8). This approach ensures the compliance with the FRBR general model, since it is the basis for the Music Ontology. A shortcut 390 was introduced between Performance and Signal, omitting Sound and Recording, permitting appropriately rich data to be captured using fewer triples optionally. The details of the recording process are less important in jazz than possibly in other genres. The events and agents layers dif-395 fer significantly from the Music Ontology -see Section 4.1.
The Event 27 and Timeline 28 ontologies were re-used where possible. One notable exception includes subclassing timeline:Instant and timeline:Interval to express approximate date spans -see Section 4.3.

400
The Ordered List Ontology 29 was used to retain the order of tunes within a medley -see Section 4.2.
Relationships were added to express specifics of jazz sessionography, e. g., a band having a leader. In particular, a variety of relationships between a performance 405 and a tune were reflected in properties like has intro, variations of, theme of and changes of (E.4), see Section 4.2. For music artist relationships, properties from the LinkedJazz model are re-used, see Section 5.6). 26 A dedicated bibliography can be found in the Dig That Lick project deliverable "Licks in the Literature of Jazz Research" http://dig-that-lick.eecs.qmul.ac.uk/Docs/DTL--Lit% 20Review.docx. 27  properties, with the exception of imported ontologies (e.g. LinkedJazz uses camelCase for properties) (E.5). Entity identification is only based on existing identifiers where these are unambiguously unique and are expected to remain unique in future; otherwise new unique identifiers 415 are created (E.6). For example, identifiers for Sessions and Musicians from the Jazz Discography (see Section 5.2) are re-used; while musician names are not unambiguous and are not used for identification: a name can have different spellings or refer to more than one person. Musicians and 420 Bands must have a name; Tunes and Tracks must have a title; Instruments must have a label (E.7).

Evaluation (A.6)
Our approach to evaluating the Jazz Ontology is based on testing the ontology in practice through populating 425 (Section 5) and merging (Section 5.4) heterogeneous datasets based on the ontology (G.3), and then releasing online tools that query the datasets for active use by the jazz community [20] (G.5, see Fig. 1 in the Supplement). Additionally, SPARQL queries were implemented (Section S3 430 of the Supplement) based on the competency questions defined in Section S1 of the Supplement -these can be run when the ontology is modified to ensure consistency (G.1). See Section 6 for more details.

Semantic model 435
This section describes the semantic model developed primarily for the representation live jazz performances and liking jazz related metadata. We begin with the general model that fulfils the formal requirements one to three (Section 3.1). Model extensions for requirement four (seg-440 mentation and licks) and for requirement five (provenance and workflow) are also presented below. Figure 4 shows the basic model that describes and relates discographic and sessionographic information in jazz.

445
It is based primarily on the FRBR model and a large subset of the Music Ontology. Alignments with the Mu-sicBrainz ontology are also provided for expressing information about discographic entities. These are shown in parenthesis in Fig. 4. This aims to achieve maximum in-450 teroperability, retaining the generality of the Music Ontology, at the same time allowing a direct mapping for the jazz-related data crowd-sourced and curated by Mu-sicBrainz. We keep the same colour and shape coding for FRBR concept groups throughout the paper, to illustrate  The shapes represent classes with their semantic concept in jazz as well as Music Ontology (mo) and Dig That Lick (dtl) classes they were implemented as. MusicBrainz (mb) semantic concepts are given in brackets where they apply. For the complete diagram see Fig. 1

in Supplement
We introduced a shortcut between the Music Ontology classes Performance and Signal, bypassing the abstract Sound concept (mo:Sound) and the recording event (mo:Recording). This is useful for simplifying the description of historical recordings where information about 465 recordings can no longer be obtained. The shortcut is justified conceptually for contexts in which the recording process does not require any additional description. In fact, the recording process is more straightforward in jazz, where the focus is on live performance, than in other 470 genres of Western music: whereas for popular music the producer plays an important role and the recording equipment and setup can be captured, for jazz recordings this information is usually not available. If a future ontology user decides to add information about recordings using 475 the Music Ontology, rules can be defined to infer the relation between mo:Signal and mo:Performance, ensuring compatibility with software using our ontology.
In Fig. 4, Session is defined as an event consisting of sub-events which are Performances of a single song/-480 piece/composition. Tune is the basis of a jazz performance, which provides a melodic theme and/or a chord progression for improvisation. SoundSignal is the sound of a song/composition captured in the audio signal. It is realised as mo:Signal, but it also relates to mo:Sound from 485 a conceptual perspective. See Section S2 in the Supplement for more detailed descriptions of entities in the Jazz ontology. Fig. 5 illustrates some significant differences between the Music Ontology and our model. In jazz, musicians 490 often play more than one instrument, sometimes during one performance, sometimes between performances. They may also pick up new instruments during their career. To reflect this, we introduced a new Performer class which denotes the relationship between a Performance and a Musi-495 cian. This fills a gap in the Music Ontology which does not include a generic Performer class reflecting this prominent concept in jazz music. A Performer is best conceptualised as a triple of the form (Performance, Musician, Instrument); there can be more than one Performer relating a 500 Performance to a given Musician, if that Musician plays more than one Instrument in that Performance.
Bands in jazz are a much looser concept than, for example, in Western pop music. Their lineups can change from Session to Session; the constant of the band is usu-505 ally its leader. Often the band is named after the leader -jazz databases often exploit this notion in their models too (see Section 5.2). In other cases, bands do have names which are distinct from the leader's name -we therefore decided to model bands explicitly as mo:MusicGroup con-510 nected to its leader (a MusicArtist) using the has leader property. In some circumstances however, particularly in jam sessions, a band may have no leader or may have more than one leader. This is permitted by the ontology as no strong commitments are made at the logical level (e.g. no 515 cardinality constraints or complex class descriptions).
Another important notion in jazz is the relationship between the Band and its lineup. The lineup changes frequently from session to session. This may be observed even between tracks; in free text annotations it is often repre-520 sented through notes like: "musician A on tracks 2 and 5" or "for this session musician B departed and musicians  C and D joined". The way to disentangle this maze of often inconsistent annotations unambiguously is to relate Performers to single Performances and not to Bands.

525
It must be noted that jazz performances are not always released or even recorded (therefore the properties captures and published as are optional). Also many jazz performances are jam sessions, where musicians play together in ad-hoc combinations without a designated band, 530 so a band is not necessarily present for each session. Yet, in recorded and released sessions, a band is usually named, sometimes just as a list of performing musicians, following music industry conventions which require a single name for the performing artist.

Medleys
Since jazz is based strongly on improvisation, modifying and recombining existing tunes is a widespread practice, and as a result, medleys are frequent in our datasets. Even though performance of medleys can be represented 540 implicitly using the event decomposition model, they are not directly modelled in the Music Ontology. A Medley is typically a set of tunes strung into one performance, without breaks between them. Often various musical devices are used to justify the transitions from one 545 tune to the next. In the overwhelming majority of cases, the combination of the tunes is linear, e. g., they follow one after another. We therefore turned to the Ordered List Ontology to model the order of the played tunes. 30 The complete medley is modelled as an OrderedList and 550 the tunes within it are the objects in its slots. The Performance is then related to both the list (the medley tune) and to each of the tunes (Fig. 6).
Alongside medleys, there are many other ways how existing tunes can be recombined, transformed, or built 555 upon in jazz. We therefore introduced further properties relating Performance and Tune, such as has intro, changes of, theme of, variations on, citation of, all of which we encountered in our datasets.

560
For a systematic study of jazz, session dates are crucial, yet they are not always known. In many cases only approximate dates are given. Dates are often annotated as text, in a variety of formats which are not always consistent (see Fig. 7). For example, a date string ca. mid to 565 late summer 56 implies that the event took place around the second and last thirds of summer 1956. The century in case of jazz is unambiguous since jazz is around 100 years old. The approximation qualifier ca. can be interpreted as around, meaning that the event might have happened 570 slightly before or after the given time constraints. Yet quantitatively ca. is much less clear cut than, e. g., late or beginning. Ca. is also different from probably, which indicates general uncertainty, without affecting the given time constraints.  To account for these ambiguities we introduce two new classes building upon Instant and Interval from the Timeline Ontology: QualifiedDateInstant and QualifiedDateInterval. They inherit the original attributes from Instant (at) and Interval (at and duration); additionally the date 580 is classified as approximate or exact, and the original string is supplied for later processing (Fig. 8) Fig. 9 lays out the model for music segmentation in audio and symbolic representations. The left column corresponds to the complete performance, the middle column to the solo and the right column to the lick segment. Other types of segments can be represented analogously. SoloP-600 erformance and LickPerformance are placed on the Performance Timeline, with timestamps for the beginning and the end of the segment or its duration (see Section S2 in Supplement for class definitions).
The upper row represents the events, extending the ba-605 sic model diagram to the right (see Fig. 4, Performance is a sub event of Session). The second row represents our concept of SoundSignal. The third row is the transcription of the SoundSignal, i.e. a symbolic representation of the signal such a musical score. Transcription can be 610 either manual or automatic. There may be more than one transcription, in fact, in case of automatic processes, there will probably be a large number of transcriptions. The transcription will be in any format produced by the transcriber. Finally, the lower row represents a symbolic 615 transformation of the transcription to the desired format: it can be a score; a pitch or a scale degree representation; an interval representation like "2,1,2,-3,-4,2", indicating the number of semitones between played pitches; onset time list, or any other form of symbolic representation.

620
These representations are often generated with the view of further processing, such as pattern matching. There can be symbolic transformations of various types for a given transcription. It should be noted that the lowest layer corresponds to technical tasks in the musicological analysis of 625 jazz, rather than representing entities that are important in the domain of jazz specifically.

Provenance and workflow in jazz analysis
This section describes the modelling of musicological analysis of jazz, in particular using automated metadata 630 generation. This part of the model is less specific to jazz and can be applied in a similar way to modelling analysis tasks in other musical genres and traditions.
Automatic metadata generation leads to a multiplicity of metadata versions of varying quality. For instance, when 635 an algorithm for automatic melody extraction is tested, evaluated, and improved, followed by further testing and evaluation, a melodic transcription is generated at every iteration. To keep track of the versions and their origins, provenance and workflow capturing are essential.

640
The Provenance Ontology (PROV-O) 31 lays out a general framework for describing provenance. Its three main classes are Entity, Activity, and Agent. In the case of metadata generation, a workflow is an Activity which generates entities such as a Transcription for a Sound or a 645 Match for a pair of Licks. In the case of manual transcription, the transcriber is the Agent associated with the TranscriptionWorkflow; for automatically generated transcriptions, the workflow is associated with the algorithm, code, or application which was executed.

650
Our datasets in the context of automatic metadata creation can be understood as a layered digital library [43] that comprises an audio collections layer, a metadata layer, and a computational analysis layer which is described by workflows. The outcomes produced by work-655 flows constitute the feature metadata layer -the input for the exploratory analysis, which can be carried out by other workflows or by human researchers. An example of the application of this concept to a digital library of music is described by [78].

J o u r n a l P r e -p r o o f
Journal Pre-proof further automatic metadata generation agents that may be represented by the ontology include SoloFinder for automatic solo detection, and InstrumentRecogniser for the solo and lick segments. Any automatic content analysis process can be represented similarly.
Finally, the whole process of symbolic representation generation, from transcriptions to n-grams, together with the generated data and the pattern matching could be summed up as LickSimilarityWorkflow and managed as a single object. This is of particular use for jazz data researchers who are less interested in the technical details of lick matching and prefer to focus on further analysis tasks 685 for which lick matching is just one of the steps. Such an object may provide an API enabling the researcher to set important parameters such as similarity threshold, and using default values for all other parameters, without having to set up each workflow separately. At the same time, when inconsistencies arise, e. g., the SPARQL query does not find that one important result it found yesterday or a year ago, the documentation of the workflow allows full accountability about the changes undertaken on the automatic algorithms' side, and, if necessary, reproducibility 695 of previous results. For a more detailed modelling of automatic audio analysis workflows see [43].

Ontology population: RDF dataset content, integration and enrichment
In this section we describe how we collected audio data-700 sets (Jazz Encyclopedia, Illinois, Porkpie); how they were associated with sessionographic information; how they were recombined for the purposes of coverage and representation (100 Years of Jazz, DTL1000); and how our RDF datasets (JE, ILL, 100 Years of Jazz, DTL1000) were con-705 structed.
In order to test the ontology and its support for semantic integration and interoperability, we assembled a large corpus of jazz recordings. The data comprises over 50K tracks, aiming to ensure that the collection was broadly 710 representative, without obvious gaps or omissions. The collections that contributed to our audio corpus were: • the Jazz Encyclopedia -a collection of 500 CDs documenting jazz from its beginnings to the 1950s.
• the Illinois collection -jazz CDs from the library of 715 the University of Illinois Urbana-Champaign.
• the Porkpie collection -a set of audio CDs used in the JDISC project (but not directly related to the metadata collected in the JDISC project).  tion: title(s), composer(s), band name, date string, area string, and a list of musicians with their corresponding instruments. Although the data shows evidence of considerable effort and careful curation, within the fields of the top-level structure there is a often unstructured text 750 which is non-trivial to parse. In addition, the file required cleaning: in some cases data appeared in wrong fields, and a significant number of musician and instrument lists were truncated. We cleaned the data and reconstructed the truncated strings with the help of the printed booklets. 35

755
Parsing track titles presented an interesting problem. Normally, a track title contains the name(s) of the tune(s) played in a performance. In some cases, e. g., when the track is a medley, more than one tune title is provided. These cases were mostly structured consistently in the 760 32 https://osf.io/buxvr/ 33 https://osf.io/rqk7z/ 34 MusicBrainz entry: https://musicbrainz.org/series/de056225-6766-4e7d-95e9-7b032b8b2b79 35 The cleaned version of the file can be downloaded from the Open Science Framework project: https://osf.io/6jes7/.

J o u r n a l P r e -p r o o f
Journal Pre-proof CSV fields. Yet there were more complex cases, such as where a performance had an introduction, or was based on a theme from a different tune, and that title was given in brackets. Other information was also given in brackets: an alternative title of the same tune, a composer of a classical piece, additional information such as take or part number. Finally, in some cases brackets were part of the original tune title. Our parser attempts to disambiguate these cases, when sufficient information is provided.
Instrument abbreviations provided in the CSV file, al-770 though largely consistent, did not match those commonly used in jazz documentation. We provide a mapping to standard instrument names where possible -they are also included as an attribute to the instruments in the RDF repository alongside the original description. There are 775 many exotic instruments in this dataset, from washboard to piccolo flute. Additionally descriptions of other sound events are included (e. g., 'not audible' or 'occasional shouting') which makes it difficult to parse the instrument data. Performers or their instrument metadata may include a 780 question mark. To capture this type of uncertainty, we implemented a confidence attribute for the performer class. We stopped short of devising a thesaurus of musical instruments, given the time constraints of the project. See [79,37] for knowledge representation issues in musical in-785 strument ontology design and automatic instrument classification methods. Of all metadata, dates required the most effort to clean and process (see Section 4.3 for examples of provided date strings). We developed a parser for approximate datespans 790 capable of processing the dates in our datasets. Altogether we have been able to process about 20,000 approximate dates with our parser. The code is available on GitHub. 36 Approximate dates are a frequent feature in a number of domains outside music, e. g., history, archaeology and cul-795 tural heritage more generally. Also, in many cases, user generated metadata results in approximate or ambiguous dates. We anticipate that our tool will be useful in those domains.
Overall, our Jazz Encyclopedia RDF repository 37 con-800 tains high quality, detailed data and is one of the very few machine-readable jazz metadata repositories of this size and quality.

The Jazz Discography
The Jazz Discography 38   The Jazz Discography is organised by Leaders and the 815 concept of Band is otherwise absent (Fig. 12). Where a band name does not contain the leader's name, it is stored in the leader's 'first name' field. Otherwise band names are omitted. We introduced both concepts, bands (as a class) and leaders (as a property). Information about 820 sessions for each leader is presented as human-readable semi-structured text, For example, a list of musicians is given for the first session, and for the following sessions remarks indicate which musicians departed and which new musicians joined. There are further remarks for single 825 tune performances within sessions, indicating, e. g., that a given musician only plays on track one. The necessity of such remarks clearly shows that the model adapted by the Jazz Discography is not ideal and needs refinement. We therefore introduced a concept of Performance which 830 captures what would become a track on a CD. Each Performance has its own list of performers, thus avoiding the need for comments and adjustments of lineup lists (see Section 4.1). Performance also allowed us to link sessionographies to discographies, connecting Performances and 835 Tracks via SoundSignal. Dates, places and venues of sessions in the Jazz Discography are given in one string. It was not always possible to parse these strings properly due to inconsistencies of syntax. Where separation of place and date was possible, we 840 used our dateParser to process dates, which successfully parsed a large majority of dates.

The 100 Years of Jazz dataset
The Jazz Encyclopedia (Sec. 5.1) offered excellent and reliable metadata and good coverage of jazz styles and   complement this audio dataset to cover the whole history of jazz, we chose audio tracks from our Illinois audio collection which we could match with sessionographic metadata from the Jazz Discography (Sec. 5.2). The cov-850 erage of matching metadata between the Jazz Discography metadata and our audio CD corpus was surprisingly low (matching via label names resulted in 34 % coverage). 3,368 tracks from the Illinois audio were chosen covering later decades of jazz starting from 1960. These tracks com-855 prised the ILL dataset. The union of the two collections, the JE and the ILL datasets, consists of 12,433 tracks with audio recordings and corresponding bibliographic and sessionographic metadata, representing the 100 years of jazz history -we called this corpus the "100 Years of Jazz" 860 dataset.
For this audio dataset, we built two RDF repositories, one for Jazz Encyclopedia tracks and one for Illinois tracks, resulting in JE and ILL RDF datasets respectively. While both sources provided different sets of metadata 865 (compare Figs. 11 and 12), each with its own inconsistencies, we were able to express the metadata and their relationships in terms of our semantic model in both cases, thus demonstrating the generality of our model. The two RDF datasets were then merged based on name equiva-870 lence (see Section 5.4), resulting in the 100 Years of Jazz RDF dataset.

Merging and reconciliation
We performed the merging of two of our repositories: the Jazz Encyclopedia (JE) and the Illinois subset with the 875 Jazz Discography metadata (ILL). This process is a proof of concept for further merges of jazz metadata repositories using our ontology.
Because the two datasets did not overlap in time, no disambiguation of event objects (Sessions, Performances) 880 was required. The decision was taken by jazz researchers in the consortium that in the case of Musicians and Bands equivalence by name is a good approximation at this stage (this approach to disambiguation of musicians and bands had been adopted previously within the datasets). Addi-885 tionally, for Tunes and Instruments, equivalence by title has been assumed. A straightforward implementation enabled a merged dataset out of the box which successfully drives pattern and similarity search interfaces.
In a more general case where event objects have to be 890 disambiguated and where equivalence by name and by label cannot be assumed, the process will be more complex, involving several iterative stages. If audio is available, audio fingerprints provide a good starting point for finding duplicate Signals, which in turn point to equivalent Perfor-895 mances (one-to-one relationship). An other starting point for finding equivalent objects could be attribute comparison for Sessions: if two Session objects share the date, the place and the Band, they are referring to the same session. Further, objects related to merged entities could be exam-900 ined: if two Sessions contain the same Performance, they should be representing the same session; if two Bands play the same Session, they are in fact the same band.

DTL1000 dataset
The DTL1000 audio dataset is a 1,060-track subset of 905 the 100 Years of Jazz dataset, which is balanced in regard to the number of tracks per decade of jazz history and covers a range of different jazz styles. Like the 100 Years of Jazz dataset, it includes audio tracks connected to the respective discographic and sessionographic meta-910 data. Additionally, all tracks were segmented and their style was annotated. The 1,060 tracks were manually segmented into parts such as 'theme', 'solo', 'intro' and 'fours'. Solo parts were further annotated with the solo instrument. The segmen-  [34][35][36][37]. For efficiency reasons, each annotator annotated only one half of the tracks, while reaching agreement for some occasional borderline cases.
Further, melodic lines were automatically extracted for 925 all 1,705 monophonic solos using the novel deep learning approach developed by a Dig That Lick project partner [18]. These included instruments such as trumpet, trombone, saxophones, clarinet, violin, cornet, flute. Melody extraction and pattern matching for polyphonic solos, such 930 as by piano or guitar, are much harder tasks and were not addressed. For transcribed solos, note patterns reflecting licks were collected, and lick matches documented. We updated our RDF datasets with these manual annotations (Fig. 13). Style labels were attached as an at-935 tribute to the SoundSignal class. SoloPerformances and their Instruments were added in accordance with our semantic model (see Section 4.4). Solo Performers were automatically inferred. Transcriptions and licks were processed in a separate PostGresSQL database which pro-940 vided a very efficient implementation of pattern matching; therefore, there was no need to represent them in RDF, particularly given that the number of n-grams was three orders of magnitude larger than the number of solos.
Instrument could always be assigned to the SoloPerfor- In a different type of problem case, the annotated solo instrument was not found in the lineup. This was either due to errors of manual annotation (e. g., where instruments are very similar, such as cornet annotated as trum-970 pet) or due to errors in metadata parsing and processing particularly for the Illinois/JD part. Most of these cases have been resolved manually based on the consortium's jazz expertise.
The resulting repository is used for the "Pattern Search" 975 and "Similarity Search" applications 40 created in our project, which allow to specify a lick and to search the DTL1000 dataset for the same or similar licks. Metadata is used as prefilter and also displayed with the search results. Fig. 13 shows the subset of the ontology used for our interface. 980

Enrichment
While most of the metadata in our RDF datasets has been collected from other sources, we have also enriched these data with new information during our project. DTL-1000 was enriched with manual annotations, in particular, 985 with segmentation information, solo instruments, and style annotations (see Section 5.5). Based on manual annotations, solo performers were inferred. For larger datasets, manual annotation is not sustainable. We have introduced two additional ways to enrich data automatically.

990
First, we linked our Musicians to existing Linked Open Data entries for them. We relied on the collection created by the LinkedJazz project 41 (see Section 2) which scraped the most informative parts of the Web in search of entries for jazz musicians, in particular, DBpedia, 42 the Li-995 brary of Congress authority files 43 and the Virtual International Authority File. 44 Authority files provide variants of the person's name spellings and variants across languages, alongside further curated information; therefore, linking to authority files facilitates disambiguation as well as further 1000 identification of entities referring to the same person. A link to DBpedia adds a human usability aspect, allowing to display free text information such as a Wikipedia article as well as photos of the musician and further links, each time a given musician is retrieved.

1005
Second, we used inference to add relationships between jazz musicians which did not exist in the original data. For each Session we collected all the musicians who played in it, alongside the band name and the leader. Based on the experience of the LinkedJazz project, we introduced the 1010 following relationships inferred from the existing Session and Performance information: • lj:bandMember for each musician • lj:bandLeaderOf between the leader and each musician 1015 and further between each two musicians: • rel:knowsOf These properties facilitate advanced reasoning about musicians' careers, relationships, and influence.

RDF dataset statistics
This section presents some descriptive statistics about 1025 our repositories. Table 1 outlines the overall number of entities in each dataset. Table 2 shows entities which possess important attributes, allowing for more informative queries. Finally, Table 3

Model evaluation
In the previous section, we have shown that the Jazz Ontology could be used to model different, heterogeneous metadata resources, making them interoperable, enabling 1035 dataset merging and enrichment. Next we turn to formal requirements (Section 3.1) to verify that they are fulfilled; and to competency questions (see Section S1 of the Supplement) to ensure that these can be answered successfully based on our datasets.

Formal requirements
Formal requirements describe what entities and relationships have to be represented by the ontology. The Jazz ontology incorporates discographic concepts such as Track, Release, Album, and Label (FR1), see Fig. 4 for 1045 details. All Jazz Ontology class definitions are listed in Section S2 of the Supplement. It also includes Sessions (FR2) which have attributes Date, including approximate dates (see Section 4.3), Area and Venue, and are connected to the Band that played in the given Session (Fig. 5). A

1050
Session is related to the Band lineup and musical instrument information via two further concepts: a Performance and a Performer, allowing for unambiguous representation of varying band lineups during one Session (FR2). A Performance is also linked to the Tune or a series of Tunes 1055 (Section 4.2) played in it (FR2). Relationships between Musicians playedTogether and bandLeaderOf have been precalculated for our public datasets 45 (FR3). Our ontology represents segmentation entities such as Solos and Licks (Fig. 9, FR4) as well as the workflow for automat-1060 ically created metadata (Fig. 10, FR5), though licks and workflows were not implemented in our datasets.
For the competency questions formulated in Section S1 of the Supplement, we list SPARQL queries and their results on our datasets in Section S3 of the Supplement. 1065

Ontology metrics and formal validation
To formally assess the quality of the Jazz Ontology we follow the approach of [80], presenting domain coverage and popularity measures (Tab. 4). We also report further metrics calculated by the Protege ontology editor [81].    [83], and FaCT++ (version 1.6.5) [84]. No inconsistencies were found. 1080

In-use validation
During the Dig That Lick project two online applications for investigating licks in jazz were created and released to the jazz community [85]. The Pattern Search 46 and the Similarity Search 47 interfaces (see Fig. 1 in the Supplement) utilise the SPARQL queries created to address competency questions, to allow prefiltering of searches (e.g. limiting them to a particular period) and to display the metadata about the lick query results (e.g. who played the matching licks and when). The Timeline and Network 1090 views of the retrieval results give further insights into the presence and distribution of licks [20]. The interfaces were first evaluated by the jazz expert members of the consortium and then released to the public. In the first month after their release in November 2019 hundreds of searches 1095 46 https://dig-that-lick.hfm-weimar.de/pattern_search/ 47 https://dig-that-lick.hfm-weimar.de/similarity_search/ were performed by users from around the world. Moreover, novel research was conducted by means of these applications [76, 86,87,88]. This active and productive use of the interfaces by the jazz community demonstrates the need for and the value of the Jazz Ontology. 1100

Conclusion
In this paper, we have presented an ontology that provides a semantic model of jazz, and assessed the ontology in its capacity to facilitate metadata integration in the Dig That Lick project. The main contributions in-1105 clude a semantic data model describing the jazz domain in a way applicable to large, automatically collected and processed datasets. The model is based on detailed domain knowledge, such as band lineup fluidity and importance of band leaders. The integration of discographic 1110 and sessionographic information was crucial to relate audio recordings to informative metadata. This experience might be valuable in other musical genres and traditions where digital discographies do not deliver essential information needed by users, such as perhaps world music or 1115 ethnomusicological recordings.
Also, with a view to large datasets and automatic metadata creation, we introduced provenance and workflow modelling for documenting these situations, as the basis of reproducibility and explainability, as well as algorithmic 1120 optimisation. We created four RDF repositories of different sizes with different aims, populating our ontology and therefore demonstrating its practical applicability. Further, we have demonstrated that repositories based on our ontology can be merged if mechanisms for entity resolu-1125 tion are available. These repositories are being used in online applications for investigating patterns in jazz performances. Moreover, they open many avenues for further systematic studies in jazz history and musician relationships.

Future work
Given the complexity of jazz and the richness of information collected by researchers, journalists, and music fans over the years, our work has been a start of what we hope will be an ongoing effort. Below we list some direc-1135 tions in which this work can be taken further.
While we processed the dates from structured but not always consistent strings into machine-readable data, we have not done the same for places. Geographic information is important for jazz studies as well as for automatic 1140 processing of, e. g., relationships. Ideally, the place strings should be parsed to extract the area information (e. g., New York) as well as venue where it is given.
A musical instrument thesaurus would be a valuable addition. This could extend the Hornbostel-Sachs classifi-1145 cation [89], adding modern instruments and providing an OWL or SKOS representation. A thesaurus would not only J o u r n a l P r e -p r o o f Journal Pre-proof improve search and retrieval, it could also help to improve the matching mechanism for both instruments and musicians. A comprehensive thesaurus would allow for instrument label parsing even where comments have been added into that string. Moreover, cases where instruments have been mislabeled due to difficulty in recognising them (e. g., tenor saxophone vs. alto saxophone) would be uncovered and attributed more easily. Also, musicians often play re-1155 lated instruments and more rarely play unrelated ones: if two players with the same name play various saxophones, there is a higher chance they are the same person than if one of them would play trumpet and the other drums.
Following on from that, better mechanisms for disam-1160 biguating musicians' names and band names would improve the quality of the repositories. New strategies for matching musicians and bands could be developed, e. g., based on existing Linked Open Data. We envision that a better resolution of musicians and bands would result 1165 in particular from more overlapping data, since each new match (e. g., fingerprint match between signals or date match between sessions) would lead to further matches between related entities in the iterative process. Integrating further datasets would therefore allow for musician and 1170 band disambiguation that is less reliant on string matching of the names. Another problem we encountered was the identification of the soloist, particularly in situations where several musicians in the performance lineup play the annotated 1175 instrument. Firstly, information about the order of solos and soloists is sometimes available, though in less digitally accessible form: the Jazz Encyclopedia's printed booklets sometimes have it as a comment, as do the Jazz Discography web pages. Additionally, automatic performer recog-1180 nition could be used as a tool to disambiguate the soloist. It could be based on lick preferences -using the outcomes from the Dig That Lick project. Alternatively, acoustic properties of the instruments could give further clues about the performer's identity.

1185
In general, automatic audio content analysis is a wide playing field for further enrichment of the metadata. In our project, we automatically extracted note events and licks. Music information retrieval tools exist for identifying chords and harmonic content, key, onsets, tempo, 1190 and rhythm, etc. Jazz has been of particular interest to the field of Music Information Retrieval, since it presents the community with a number of challenges, some of which have been explored in this article. In regards to audio, jazz recordings are usually polyphonic, with a rich harmonic 1195 content, with little repetition and differing segmentation rules compared to pop music. We hope that researchers in Music Information Retrieval 48 will take on some of these challenges and produce additional metadata for our corpus.
which can then be investigated by jazz researchers; or to evaluate hypotheses based on knowledge and intuition about jazz with the help of data. Feeding the findings J o u r n a l P r e -p r o o f Journal Pre-proof the 19th International Society for Music Information Retrieval Conference (ISMIR), 2018.