Medieval Manuscripts and Their Migrations: Using SPARQL to Investigate the Research Potential of an Aggregated Knowledge Graph

Although the RDF query language SPARQL has a reputation for being opaque and difficult for traditional humanists to learn, it holds great potential for opening up vast amounts of Linked Open Data to researchers willing to take on its challenges. This is especially true in the field of premodern manuscripts studies as more and more datasets relating to the study of manuscript culture are made available online. This paper explores the results of a two-year long process of collaborative learning and knowledge transfer between the computer scientists and humanities researchers from the Mapping Manuscript Migrations (MMM) project to learn and apply SPARQL to the MMM dataset. The process developed into a wider investigation of the use of SPARQL to analyse the data, refine research questions, and assess the research potential of the MMM aggregated dataset and its Knowledge Graph. Through an examination of a series of six SPARQL query case studies, this paper will demonstrate how the process of learning and applying SPARQL to query the MMM dataset returned three important and unexpected results: 1) a better understanding of a complex and imperfect dataset in a Linked Open Data environment, 2) a better understanding of how manuscript description and associated data involving the people and institutions involved in the production, reception, and trade of premodern manuscripts needs to be presented to better facilitate computational

Although the RDF query language SPARQL has a reputation for being opaque and difficult for traditional humanists to learn, it holds great potential for opening up vast amounts of Linked Open Data to researchers willing to take on its challenges. This is especially true in the field of premodern manuscripts studies as more and more datasets relating to the study of manuscript culture are made available online. This paper explores the results of a two-year long process of collaborative learning and knowledge transfer between the computer scientists and humanities researchers from the Mapping Manuscript Migrations (MMM) project to learn and apply SPARQL to the MMM dataset. The process developed into a wider investigation of the use of SPARQL to analyse the data, refine research questions, and assess the research potential of the MMM aggregated dataset and its Knowledge Graph. Through an examination of a series of six SPARQL query case studies, this paper will demonstrate how the process of learning and applying SPARQL to query the MMM dataset returned three important and unexpected results: 1) a better understanding of a complex and imperfect dataset in a Linked Open Data environment, 2) a better understanding of how manuscript description and associated data involving the people and institutions involved in the production, reception, and trade of premodern manuscripts needs to be presented to better facilitate computational research, and 3) an awareness of need to further develop data literacy skills among researchers in order to take full advantage of the wealth of unexplored data now available to them in the Semantic Web. technical descriptions and publications of the project are available at: https:// mappingmanuscriptmigrations.org/en/), funded by the Digging into Data Challenge of the Trans-Atlantic Platform between 2017 and 2020, were to bring together data relating to the history and provenance of medieval and Renaissance manuscripts and to explore the research potential of the aggregated dataset. Based on the Linked Data publishing model (Heath and Bizer 2011) and the W3C Semantic Web standards and technologies (https://www.w3.org/standards/semanticweb), including Universal Resource Identifiers (URI), the RDF data model, ontologies (Staab and Studer 2009), and the SPARQL query language (SPARQL recommendation of W3C: https://www. w3.org/TR/sparql11-query/) for querying RDF data, the project resulted in establishing a Linked Open Data (LOD) service (SPARQL recommendation of W3C: https://www. w3.org/TR/sparql11-query/) and a public MMM portal (the MMM portal is available at: https://mappingmanuscriptmigrations.org). The data and the portal ) allow users to access and query across three distinct datasets, each focusing on premodern manuscript data but built to serve three different purposes: the University of Pennsylvania's Schoenberg Database of Manuscripts, the Institut de recherche et d'histoire des textes' Bibale database, and the Bodleian Library's online catalogue Medieval Manuscripts in Oxford Libraries (respectively, https://sdbm.library.upenn.edu; https://bibale.irht.cnrs.fr; and https://medieval.bodleian.ox.ac.uk). The MMM project also made the transformed datasets (for a full report on MMM data modelling and transformation from legacy databases, see ) available for direct searching and downloading on the Zenodo repository (https://zenodo.org/record/4019643). §2 The work of modelling, combining, and presenting the MMM data was carried out by project team members from the e-Research Centre at Oxford University and the Semantic Computing Research Group at Aalto University, and was based on a series of twenty-four research questions determined at the outset by the project's manuscript researchers at the IRHT and the Schoenberg Institute for Manuscript Studies as well as by members of a focus group gathered in the early stages of the project. The questions were designed to serve as examples of the kinds of inquiries that researchers would want to make in order to identify the key data points they would want to access and query for the data modelling team. They were also used to analyze and test the data model and the viability of the aggregated data and were then used in the evaluation of the public MMM portal (Burrows et al. 2020). To these ends, the original research questions were fundamental to the shaping and successful implementation of the project. §3 While the launch of the MMM LOD service and portal marked the formal end of the project, for the MMM project team it represented a path to a new frontier for research. The portal, based on the Sampo model (the Sampo model and series of semantic portals are described in: https://seco.cs.aalto.fi/applications/sampo/) and Sampo-UI framework ) with its search, data exploration, and data analysis functionalities, is an interface that lies between the users and the underlying RDF data. The portal can be used without programming skills or knowledge about the SPARQL language. The user can choose from five perspectives-Manuscripts, Works, People, Places, and Events-that provide easy entrée into the dataset from different perspectives and facilitate searching and analyzing the data for users new to Linked Data. The perspectives are implemented using SPARQL queries to the underlying LOD service that mediate but also ultimately limit users' ability to query the data flexibly, extensively, and expansively. The perspectives are grounded in traditional research questions that were created outside of a computational context and are therefore not suited to take full advantage of the data model they helped to create. The really interesting data digging happens when the user confronts the RDF data directly via the SPARQL endpoint using custom made SPARQL queries for solving particular research questions.
For this purpose, SPARQL editors, such as YASGUI (Rietveld and Hoekstra 2017) can be used, or alternatively programming environments, such as Google Colab (https://colab. research.google.com/notebooks/intro.ipynb) and Jupyter notebooks (https://jupyter. org) for Python scripting for visualizations and data analyses based on SPARQL queries. §4 This paper explores this process as it was undertaken by members of the project team, the primary authors of the present article who participated in a two-year long process of collaborative learning and knowledge transfer between computer scientists and humanities researchers. The process developed into a wider investigation of the use of SPARQL to analyze the data, explore broader types of research questions, and assess the research potential of the MMM aggregated dataset and its Knowledge Graph.
Through an examination of a series of six SPARQL query case studies, we will show that as we became more adept at querying, the better we understood that the scope of original research questions had fallen short of both the abilities and the potential of the MMM data to create new knowledge about the production and transmission of manuscripts across time and that a new approach to research questions would produce better and more transparent results. In addition to analyzing the queries themselves, we will also show what the case studies reveal about the structure and contents of the MMM data, and how lacunae in the data (especially around biographical details of persons) can be compensated for by drawing in information from other Linked Open Data resources like Wikidata.
2 The research questions §5 Before turning to the SPARQL case studies, it is useful to provide further background to the development of the original research questions to provide context and highlight some of the key problems they presented when applied to the aggregated dataset. A research question is typically understood to be a question that a research project seeks to answer. Identifying a research question or set of questions is generally one of the first steps in developing the methods and techniques for scholarship, whether that scholarship is traditional or digital, because it provides a basis and a goal for starting work. The MMM research questions were based on the team's pre-existing knowledge of each dataset, but they also represented a set of expectations for what manuscript researchers might want to know about manuscripts in general (Table 1).

1.
How many manuscripts produced before 1600 in European countries survive?
2. How many manuscripts were produced in Northern Italy and/or Lombardy?
3. How many manuscripts were produced in the Low Countries?
4. How many manuscripts were produced in London in the fifteenth century? possible or that were based on flawed assumptions made by users about the abilities of the data to return the expected results. Gaining an awareness of these problems also helped the team refine the questions as their understanding of the available evidence and nature of the data increased. 4 The MMM data model and knowledge graph §12 The MMM data model, which draws on the CIDOC-CRM (Doerr 2003; for the CRM standard online, see: http://www.cidoc-crm.org/) and FRBRoo (Riva, Doerr, and Žumer 2009) ontologies for its entity classes and properties but also adds some specific to MMM, has been discussed in detail elsewhere (Koho et al. 2021, 4-10 The vocabularies for actors and places were automatically harmonized across the source data using these identifiers. Manuscripts were harmonized using shelf-marks or Phillipps numbers (assigned by the 19th-century collector Thomas Phillipps). The names of works were harmonized by manual review of string matching on titles; this only covered titles in the same language, not translated titles in other languages. §13 A temporal distribution of the events in the MMM data by decades is shown in Figure 1, with separate categories for (1) manuscript production events (  between the data elements in order to perform functional queries. In his explanation of RDF, Joshua Tauberer notes: "What is meant by 'semantic' in Semantic Web is not that computers are going to understand the meaning of anything, but that the logical pieces of meaning can be mechanically manipulated by a machine to useful human ends" (Tauberer 2006). The humans using the machine, we learned, must therefore understand the logical structure in order to manipulate it for useful computational ends. §15 When considering the MMM data model, it is important to keep in mind its relationship to the research questions. The data model is expressed in RDF, a method for describing data by defining relationships between data objects. The "subjectpredicate-object" pattern produces triples that express the relationships. A triple is the basic unit of an RDF knowledge graph. For many, the concept of triples is difficult to digest. Unlike most other data models that present data as lists of elements, such as a spreadsheet with well-defined columns or the tables in a relational database, the elements in RDF exist in something more comparable to a cloud of data, seemingly loosely connected by semantic statements. It is much harder to visualize and internalize the structure in one's mind, which may explain why understanding RDF and ways to query it are difficult for non-semantic web specialists. §16 As the syntactical naming of the units comprising a triple suggests, triples work much like sentences. In a sentence, which can also be a question, subjects and objects are related by the action or state of being that links them. If one considers triples as a list of answers to questions (who did what, what is something, when was something done), then a query in RDF is simply a triple or series of triples statements expressed in the context of a search to identify desired data elements possessing certain relationships. A simple SPARQL query can be expressed as "Show me all things associated with this thing." Then, a further relationship can be added to refine results:

SPARQL query language
"Then show me all the things associated with those things that share this value." Further triple statements can be added to the query indefinitely to execute a variety of search functions. The query, then, is only limited by three things: the researcher's ability to think of new questions to ask or new associations to make; how well the associations have been expressed in the data model in relation to the data; and how well the data has been structured so that the required data elements are accessible to the computer performing the search. §17 As we noted above, the MMM RDF data model was derived in large part from the data elements identified in the research questions described in the previous section (manuscripts, texts, owners, places of production, dates, etc.) (Figure 2).
These elements are the nodes represented in the model. The nodes are connected to each other by the properties derived from the MMM ontologies, which express all the possible relationships between the nodes, for example, "is composed of," "has former or current owner," "took place at," "has timespan," etc. In the RDF schema, the nodes are the subjects and objects connected to each other by the properties or predicates; the connections form the triples that can then be queried in a medium like SPARQL. To construct a query, one starts with a node, then follows the associations in any direction where there is a link. In such a flexible structure, the possibilities for what one can query and how are greatly expanded. For the MMM project team, achieving a high degree of familiarity with the data model enhanced the ability to query it and opened up new ways to approach the data well beyond the scope that the original research questions set out to achieve.

SPARQL queries as case studies §18
Most of the original MMM research questions could, with some exceptions, be answered to greater or lesser extent through the Semantic Web portal interface to the MMM data, using a combination of filtering and searching. But the project team wanted to go beyond this interface-partly to tackle those questions that the interface could not answer, partly to explore new questions, and partly to explore the relationships and structures in the data more fully. SPARQL queries were used for this purpose, and the remainder of this paper discusses a selection of these queries as case studies for investigating the research potential of the aggregated dataset and the MMM Knowledge Graph. They include three of the original MMM research questions, as well as further questions arising from another large European manuscript provenance project Cultivate MSS (https://www.ies.sas.ac.uk/research-projects-archives/cultivate-mss-project), an ERC-funded project led by Laura Cleaver at the Institute for English Studies at the University of London, and other questions intended to add data from sources outside the MMM Knowledge Graph. The queries based on these questions did not always return the expected results, but the lessons learned from them led to better questions and better results.

Query 1: How many manuscripts were produced in Lombardy or Northern Italy? §19
The first query, based on Question 2 of the original research questions, is a simple one requiring only two data elements: show me all of the manuscripts associated with a certain production place. In this case, a researcher may want to find all illuminated manuscripts produced in or around Milan. Milan, a specific city located within the region of Lombardy, was a major and influential centre of illumination in northern Italy especially in the late Middle Ages, but only a fraction of manuscripts produced in this area during this time have been securely localized to the city in available sources.
Shared stylistic features in script or decoration or textual references (e.g., calendars) of otherwise unlocalizable manuscripts can, however, point to affinities with this particularly "northern" style, leading cataloguers to tentatively assign "Northern Italy" as the place of production if the case for a secure tie to Milan or Lombardy is too tenuous to justify. §20 The researcher will therefore want to cast a wide net to find all manuscripts with a possible connection to Milan. The query "show me all manuscripts produced in Lombardy and Northern Italy" will return a reasonable set of results to allow narrowing down the search for manuscripts produced in Milan. A search for manuscripts produced in Lombardy will helpfully limit results, but a wider search for manuscripts produced in Northern Italy could return more expansive results and more chances for finding manuscripts that have not yet been more accurately localized.
5.1.1 Query explanation https://api.triplydb.com/s/l6M4n5Eff §21 The query (Figure 3) begins with a SELECT statement, which identifies the variable values to be returned by the query. The SELECT statement here (lines 9 to 10) includes variables that will return only distinct, or different, manuscript values and production place values. Also included are production timespans, though the timespan is not essential to the original query. Multiple production place and production timespan values associated with the same manuscript value are concatenated to avoid showing the duplicated values within the same manuscript record. URIs for Northern Italy and Lombardy as well as any places that are nested within them.
Lines 18 to 21 make optional the association between a value in production place and a manuscript, and lines 22 to 24 do the same for the production timespan.

Results §23
The query returns 1,702 instances of manuscripts, or manifestation singletons, in the combined dataset that contain the TGN IDs for Northern Italy (tgn_4005363) and for Lombardy (tgn_7003272) as a production place value. The predicate gvp:broaderpreferred* in line 15 of the query also enables the capture of cities and sites within Lombardy without having to identify and enter all TGN IDs associated with Lombardy. For example, Results 3 to 6 show "manifestation singletons," which is how the FRBRoo ontology defines a manuscript object, with Milan as the production place because Milan is contained within Lombardy in the TGN hierarchy (http://vocab.getty. edu/tgn/7003150). §24 The results also show manifestation singletons with multiple production places. These results indicate more than one place attribution has been assigned to a particular manifestation singleton. There are two reasons for this result. The first has to do with the way that manuscripts are often described: a source description identifies two or more possible places of production either because a cataloguer is hedging bets, for example, a manuscript could be described as from "Austria or Northern Italy" (http://ldf.fi/mmm/manifestation_singleton/sdbm_24767), or because a manuscript contains two component parts that were produced in different places and later bound together. The second reason is due to the data modelling: two or more of the sources could give two different places, as in this example in which the Bodleian record gives one place of production and the SDBM gives another: http://ldf.fi/mmm/manifestation_ singleton/bodley_manuscript_2010. In the case of the SDBM, a manuscript record may contain two or more entries that give different location data.

Lessons learned §25 Some general conclusions can be drawn about the interpretation of the dataset
based on these results. The results highlight inconsistencies inherent to manuscript description dependent upon human observation: differing opinions (Austria or Northern Italy?) or knowledge changes across time (it was considered to be made in Northern Italy, but recent studies now indicate that it may have been produced in Siena), and inconsistencies in data entry (production place was not provided in the source data). (The SDBM draws its data from catalogue sources that can vary widely in the amount of detail provided in manuscript description, from simple identification of author, title, and date to full codicological descriptions; it is common therefore for many details relating to the physical description of a manuscript not to be provided.) The query results therefore cannot be taken at face value and researchers must navigate through the manuscript links in the MMM record for further exploration and discovery. §26 A review of the results for this query raises the question: are the SPARQL effectively. Thus, this particular SPARQL query offers only limited advantage over more direct searching in the source datasets and no advantage over the portal. §27 While the query did not improve on the results provided by the MMM portal, the process of building the query gave shape to the data and insight into the limitations and character of the source data. This exercise, along with other early, relatively simple queries the group created, introduced the building blocks for SPARQL queries, like place and timespan techniques, that were returned to time and again.

Query 2: How many manuscripts survive that contain Spanish texts written in gothic rotunda script that were produced in Castile for an abbey or convent? How many of these were owned during the nineteenth century by English private collectors and are now owned by an institution in North America?
§28 The first case study shows a simple query that produces no further information beyond what can be gained from a filtered browse in the MMM portal. The second case study demonstrates how adding complexity to the question expands the potential of using SPARQL to query an RDF dataset. The question attempts to determine how many manuscripts exist today that were written in a certain script type and that were produced in a certain institution type existing in a specific production place. The question then proposes that those results be further limited to those manuscripts owned by a certain collector type from a specific location during a specific timespan.
A final additional query limits the search again to those manuscripts with a specific current location. §29 As one of the original research questions, this question was designed to be complex for complexity's sake, in order to demonstrate for the data modellers how a researcher might want to drill down with increasing specificity using a wide range of qualifiers. It is an intentionally challenging question that tests the limits of the source datasets. The question contains an element "script type" that was ultimately not included in the final data model because it was not adequately represented in the original data sources. The question also requires that the query be able to distinguish between types of institutions (religious, monastic) and types of collectors (private versus public) as well as distinguishing current locations among all locations identified in the data. Unlike the first case study, this question produced, not surprisingly, a fundamentally more complex query that tests not only the data model but also the user's ability to interpret the results of the query. The query was developed in two steps: first, to identify Castilian manuscripts with Spanish texts; then, to determine who produced them.  Step 2. §32 In the next step, we modified the previous query to include "produced for" information. Commissioning data is part of the production event class, so we add the ?commissioner variable at line 21 by linking it to the ?production event variable via the predicate mmms:carried_out_by_as_commissioner. Line 23 collects the labels for the ?commissioner. This modified query produces zero results, however, which tells us that no commissioning data is available for this group of manuscripts in the MMM dataset. We were thus not able to continue the query to find out more about later ownership.

Results §33
The initial query produced a list of 286 texts, or "expressions," in manuscripts which were produced in Castile and written in the Spanish language. Amending the query to look for the person or organization who commissioned the production of these manuscripts produced no results. Further exploration of the data showed that the In the MMM data, he is encoded as an owner of this manuscript, but is not linked to its production event. This suggests that some re-thinking of the transformation and mapping of personal role statements from the Bodleian data in particular might be worth considering. §36 Both "Castile" and "Spanish" are also problematic in this query. Historical regions like the Kingdom of Castile are not reflected in the TGN hierarchy of places, which is based on current administrative and jurisdictional boundaries, so the property gvp:broaderpreferred cannot be used. For this query, the ?place variable had to be bound to a list of specific place URLs from the TGN that was roughly comprehensive. The lack of availability of geographical hierarchy information, and the fact that historical boundaries change over time, mean that there is no simple method for capturing places within historical regions. Records that represent Castilian manuscripts may simply list Spain as the place of production, but there is no way to determine more specific locations within Spain in the query. §37 The term "Spanish" for language is also ambiguous. Spanish in its modern sense is a post-medieval phenomenon (Penny 2002); the MMM data sources are inconsistent in their encoding of medieval languages from the Iberian peninsula. The fullest and most accurate way of constructing this query would involve inspecting all these varieties of languages and places in the data sources, seeing the extent to which they are reflected in the MMM data, and ascertaining how best to specify them in the SPARQL query. Even a cursory look suggests a significant level of inconsistency in the source data. These considerations would still apply if the question was made much more specific along these lines: Which manuscripts containing texts in a vernacular language were produced in the Kingdom of Castile as it existed in 1217? and place and date associated with a specific event, in this case the acquisition of a manuscript with a certain text by a French collector in the 17th century, which is defined in the MMM data model as a "provenance event." All of these elements are included in the data model, but the challenge is to identify what data or combination of data determines popularity. What in the context of the MMM dataset does popularity mean?

Query 3: What was the most popular text by a medieval author in
The following query explanation attempts to extract results based on this assumption.
Because of the complexity of the query, the team broke the investigation down into a series of four query steps in which each query builds upon the results of the previous one.

Query explanation
Step 1 (Figure 6): Provenance events occurring in France: https://api.triplydb.com/s/ZWE5m487i §39 The first step of this query aims to identify all provenance events (dates optional) that occurred in France. Following the SELECT statement, Line 9 assigns a specific value to the ?event_type_uri variable, ecrm:E10_Transfer_of_Custody, by using the VALUES clause. Thus, every event type returned in the results will be a provenance event involving the transfer of a manuscript from one owner to another, as opposed to more generic provenance events where a direct transfer of ownership is not necessarily known or confirmed by the data. Line 11 states that every location returned in the results (represented by the ?place_uri variable) must be within the boundaries of France using the same predicate gvp:broaderPreferred* that was used in the first case study.
Line 14 introduces the ?event_uri variable, stating that every ?event_uri must have occurred at the places assigned to (ecrm:P7_took_place_at) the ?place_uri variable in Line 11. Lines 16-17 further define the types of information we return about events.
In line 16, the symbol a is a shorthand for the rdf:type predicate to indicate that the ?event_uri variable is an instance of the ?event_type_uri class, which we defined in line 9 as a transfer of custody event. Line 18 is an optional clause that includes the date that an event took place, if that information is present in the data. Step 2. §40 Building on the above query, the second query's results include all the manuscripts associated with provenance events that occurred in France, with the dates on which they occurred if known. Line 19 states that the ?event_uri variable is linked to the ?manuscript variable via two potential provenance event predicates: either transfer of custody events or observed manuscript events, which are provenance events where a direct transfer of custody is not confirmed in the data.

5.3.1.3
Step 3 (Figure 8). Manuscripts with their titles (optional), that had a provenance event that occurred in France in the 17th century: https://api.triplydb.com/s/WVeDNDp7V year, and then FILTER the results to include only those years that are less than 1700 but greater than 1599.

Step 4 (Figure 9). Manuscript with texts by authors who lived between 450-1500, with
provenance events that occurred in France in the 17th century: https://api.triplydb.com/s/_9cC7UFM- The strategy for limiting authors by their life dates is similar to the route taken in the previous query to find 17th century provenance events. Since the research question is interested in works composed by medieval authors, our results need to be limited to authors who lived during the medieval period, which we defined as between 450-1550 CE. Works are linked to their authors via the mmm:carried_out_by_as_possible_ author predicate, as seen in lines 33-34. Authors with known life dates will have their birth and/or death dates (which could each vary widely in specificity from a precise date to a range of time) stored separately in the database, so we need to filter on birth and death dates separately. The parameters for the authors' births are stated in line 35-39. We link from the author to their birth event (line 35), from that birth event to the timespan for that event, and then to the beginning of the timespan (?author_ birth_begin). Just as in the previous query, we use the BIND and FILTER functions on lines 38-39 to extract the year from the timespan and then filter to include only years that are less than 1500 but greater than 450. We use the same process to limit authors' death dates to the medieval period in lines 41-45, by using predicates specific to author death events and limiting the dates to between 500 and 1550.
About 20% of these records had no associated date. The second query returned 47,805 results. More than 96% of these were generic "manuscript-related events" rather than transfers of custody. About 50% of them had no associated date. The third query reduced the number of records drastically. Only 1757 records were identified as "transfer of custody" or "observed ownership" events that could be localized to 17th-century France. §45 When authors' birth and death dates were added to find "medieval" authors in the fourth query, the number of results was further reduced to 1,262 records. This

Lessons learned §46
The complexity of this research question meant that we had to break it down into parts, write queries for each part, and then assemble these into a single SPARQL query.
It also revealed that questions can be expressed in a form that is difficult to map to the terms and relationships used in the data model and the aggregated dataset. To approach a set of results that could be used to answer the question "what was the most popular text?" meant tackling a series of definitional problems and making choices about how best to define them in the context of the MMM data. §47 The phrase "most popular text" is ambiguous, for a start. It could mean the most-read text, the most-quoted text, the most-owned text, or the most-circulated text. Only the latter two have any relevance to the MMM data, since they can be expressed respectively in terms of "the text in those manuscripts with the most recorded owners" in 17th-century France, or "the text in those manuscripts with the most ownership events" in the same period. Does the question refer to manuscript owners associated with France, or manuscript provenance events which occurred in France? Does "medieval author" cover anonymous or pseudonymous works and expressions as well as those with known authors? If so, how do we identify anonymous "medieval" texts, since works and expressions do not have dates directly associated with them? §48 Whatever choices were made in relation to these definitional difficulties, the important point was to ensure that those choices were documented and explained. It might also have been possible to consider reframing the question in a less prescriptive way: "Which manuscripts with medieval texts were owned by French collectors in the 17th century?" This could have been addressed by identifying owners living in France in the 17th century and looking at the manuscripts they owned and the associated works. §49 As mentioned earlier, one factor affecting these results significantly is that titles of works have not been harmonized across translations in different languages.
There is little in the way of authoritative Linked Open Data vocabularies and identifiers for medieval and Renaissance works, and the absence of consistent conventional titles for works in this period makes the process of reconciling them between their occurrences in different manuscripts extremely difficult (Sharpe 2003). Without this kind of reconciliation, we cannot easily construct a query that takes a work and looks for all manuscripts containing that work. We should either try to identify all the variant titles of a work and include them in the query, or focus on manuscripts and authors instead. The way in which the SDBM treats multiple works in a single manuscript (as described above) also has a significant effect on queries of this kind.

New research questions and wider explorations §50
The three case studies considered so far attempted to apply research questions devised before the data model was designed and implemented. The results of each query were mixed. While the simplest query produced the expected results primarily because of its simplicity, it did not really test the ability of the dataset to return results. The more complex questions with a significantly greater degree of ambiguity were much harder to translate into the elements and relationships expressed in the data model. They revealed, amongst other things, that some queries could be too specific or too complex in their combination of criteria to produce meaningful results. They also revealed that some relationships in the MMM data (e.g., between authors and works) were too ambiguous to produce reliable results. And they showed that questions involving premodern languages or pre-modern political and administrative jurisdictions needed careful mapping to modern authoritative vocabularies for places and languages. But  we can filter the manuscripts based on their text titles, which are modeled as work labels in MMM (Figure 11). Line 26 includes a FILTER clause that refines the results to only include manuscripts that contain works that include the characters "missal," "gradual," "breviar," or "antiphon" in their titles. The character strings will be matched exactly, so the work labels are abbreviated in order to accommodate the various spellings found in MMM's data sources. Likewise, each CONTAINS function also includes the LCASE function to convert the values in the ?work_label field to all lowercase letters so as to include both upper and lowercase title variations in our FILTER clause. This concludes the subquery. As a last step, we group our results according to the ?manuscript variable on line 33 to ensure that our results table displays information for only one manuscript per row.

triplydb.com/s/-9C8qoZtb
Figure 12: SPARQL query for Query 4: Step 2 (lines 21-37). §58 This query is nearly identical to the previous query, except that it includes three extra FILTER functions to refine the results further. On lines 30-31, two FILTER functions state that all height and width measurements included in the calculations must be greater than 39 millimeters and less than 500 millimeters. Filtering the results in this way helps ensure that our results do not include typos or other data entry mistakes that sometimes appear in the measurement data. Line 37 filters the production year results to include only manuscripts produced on or after 700 CE. The choice to filter by this production year stems from a cosmetic need to produce a chart of the results that is easier to read. Since few manuscripts in the MMM dataset were produced before 700 CE, removing those manuscripts from the results creates a more efficient x-axis and greater legibility of the individual data points in the chart.   The SELECT statement includes two different sets of ratios, one for missals (line 12) and one for breviaries (line 13). §60 To calculate these two different ratios, we use the same subquery strategy as employed previously, but a UNION clause (line 40) allows the results to be displayed together. The first subquery (beginning on line 17) calculates the data for missals, using the FILTER function to isolate those manuscripts that have the characters "missal" in their work label (line 29). §61 This exact structure is copied in the second subquery (beginning on line 41), except in this case the FILTER function finds works containing the characters "breviar"

Alternative
(line 53). To distinguish the two results, the averages related to breviaries are called ?b_height_mm_average and ?b_width_mm_average.

Results §62
Step 1 of the original query visualizes the height-to-width ratios for 4,513 liturgical manuscripts (Figure 14). It includes missals, graduals, breviaries, and antiphonaries, but the ratios are not distinguished by type of manuscript. There are no limits on the date of production, or on the size of the ratios. Because there are four outlying ratios between 8.636 and 30.831, as well as a small number of early production dates, the other results are heavily compressed, and the details of the other ratios cannot easily be seen.

Figure 14:
Height-to-width ratios of liturgical manuscripts. §63 Step 2 of the original query visualizes the height-to-width ratio of 4,030 liturgical manuscripts by limiting the production period to 700 to 1800 CE (Figure 15).
The variation in ratios is also limited by the exclusion of manuscripts larger than 500mm or smaller than 39mm. It includes missals, graduals, breviaries, and antiphonaries, but the ratios are not distinguished by type of manuscript. The results are clustered around 1.25 to 1.6; most manuscripts were produced in the 14th or 15th centuries. The clusters of results for the years 900, 1000, 1100, 1200, 1300, and 1400 reflect the use of start dates for estimated production year ranges. Another version of this query makes use of end dates as well, to smooth out this kind of clustering: https://api.triplydb.com/s/uG86O-AIC.

Figure 15
: Height-to-width ratios of liturgical manuscripts produced between 700-1800 AD. §64 The alternative query compares the height-to-width ratio of two different types of liturgical manuscripts produced during the period 700 to 1700 CE (Figure 16).
The total number of manuscripts involved is 12,169. Missals are shown as blue dots and breviaries appear in red. Most of the manuscripts fall within the range 1.0 to 2.0, though the majority fall between 1.25 and 1.6. There is considerable similarity between the two different types. Relatively few manuscripts have ratios less than 1.0 (i.e., with their width greater than their height).

Lessons learned §65
Neither MMM nor the source datasets provide information about the categories or subjects of works, so liturgical manuscripts had to be identified by keyword searches on uniform titles. Fortunately, these are generally common to Latin, English, and French, such as missal/missale, antiphonal/antiphonarium, breviary/breviarium, and so on. The initial query produced a single set of ratios regardless of the specific type of liturgical manuscript; later refinement visualized the ratios for the specific types separately, enabling comparisons between them. §66 Dimensions are likely to have multiple values in the SDBM, reflecting different descriptions from different observations of the same manuscript. The same kind of variation can also be found for the same manuscript in two or three of the data sources. We dealt with this by averaging the height and the width across the different values. §67 Some problems were identified with the source data, including records that had height but not width, and some cases where mm and cm measurements were mixed together. These could produce incorrect ratios, since the query works by adding up the raw figures and then dividing by the number of values. §68 Production date ranges are often approximate, for example, "1300-1400" or "1225-1250." We dealt with this initially by taking the earliest date in the date range, that is, "1300" and "1225" in these two cases. Further refinement of this query involved calculating an average for production date ranges (e.g., 1400-1450 as 1425), to avoid results bunching together at 1400 for 15 th -century manuscripts. §69 Several outliers were noticeable in Figure 16, including one with a ratio of 30 (not shown). These were checked to see if they reflected an error in the source data, but the extreme outlier was found to be a roll rather than a codex, an unexpected result that could challenge assumptions about the use and readership of liturgical manuscripts in the Middle Ages. Our choice to remove outliers from the results meant that a more granular display of results in Yasgui became possible, but at the expense of a fuller and more accurate representation of variations in the data as the roll breviary indicates. Further, excluding outlying values for height and width actually affected the ratio calculations for some manuscripts and produced incorrect values. Excluding outlying ratios might be a better way of achieving this goal. §70 As originally formulated, the query obscured whether height or width was the larger dimension, since the ratio was constructed by dividing the larger dimension by the smaller one, regardless of which was the height or width. The resulting ratios were always 1.0 or greater. A different formulation of the query was required to show the ratio of height to width consistently; the results then included ratios lower than 1.0, in cases where a codex was wider than it was long. Choosing between these queries depends on the ultimate goal of the research: is it simply to find the average relative proportions of a manuscript, or is it examining the orientation and layout of the pages as well? §71 The resulting scatter plot showing ratios for 12,169 individual manuscripts, coloured according to their type, provided a very effective visual representation of a relatively large body of data. But these queries also made clear the importance of consistent approaches to recording this kind of data and documenting the assumptions made in analyses of the data.

Query 5: How long did the bookseller James Tregaskis keep manuscripts in his stock?
§72 The next query, derived from the work of the Cultivate MSS project, considers the length of time books remained in the stock of a particular dealer. In this case, we looked at the London dealer James Tregaskis, who was a prolific producer of catalogues, many of which have been entered into the SDBM as part of the project and are now searchable as LOD within the MMM portal (Worms 2016 for Tregaskis in lines 12-13, and the transfer count is limited to those manuscripts with 2 or more transfers (line 25).

Figure 19b
: SPARQL query for Query 5: Step 3 (lines 24-43, relating to Quaritch). §77 Step 3 of the query brings together the results of the previous two queries for easier comparison. This involves creating two similar sub-queries-one for Tregaskis (lines 11 to 27) and one for Quaritch (lines 29 to 43), combining them with a UNION command (line 28), and displaying the duration and the transfer counts from the two sub-queries in an overarching SELECT statement (line 9). The transfer count in each sub-query is limited to those manuscripts with 2 or more transfers (lines 26 and 42).

5.5.1.4
Step 4 (Figures 20a-b): Comparison of Tregaskis andQuaritch stock between 1901-1920 https://api.triplydb.com/s/syyzeyQ_q    as are the number of listings, but does this reflect anything more than the much longer time period over which this firm has operated? In some cases, Quaritch had bought back (and re-sold) a manuscript originally sold by the firm some decades earlier, so the manuscript was not actually kept in stock for the whole period in question. §86 Step 4 of the query produced 203 results for manuscript transfers between 1901 and 1920. The maximum duration was 6,605 days (18 years), and the maximum number of transfers during these 20 years was eight. This visualization makes it clear that Tregaskis was likely to list the same manuscript many more times than Quaritch during this period, and usually within a significantly shorter period of time (Figure 24). The most common combination is visible in the largest blue bubble in the lower left of the chart: a duration of 792 days and a transfer count of 2, with Quaritch as the seller. A total of 30 manuscripts have this combination. The configuration of the bubble chart has been used to limit the maximum duration shown to 5,000 days, for the sake of visibility (Figure 25).  There is no way of distinguishing manuscripts that were sold, bought back, and sold again by Quaritch from those that were kept in stock for a number of consecutive years and advertised in multiple catalogues during that period. The duration in stock is simply calculated from the earliest listing to the last recorded listing. Manuscripts with a duration of zero days between two listings may have been advertised twice in the same year, without a specific day or month being recorded, but these entries may also reflect two versions of the same catalogue or stock list entered separately in the Schoenberg Database. collector, and archaeologist to politician, statistician, and lawyer, among others. §97 As these results show, this query strategy requires both the MMM person and the Wikidata person to have a shared VIAF identifier to return results. Our results point to the broader problem of the lack of representation of a large number of actors in available authorities and LOD resources. A systematic import of Wikidata identifiers into MMM (or into the source datasets) would increase results, but the problem will not be fully addressed until actors in underrepresented social groups and minorities are given better data representation in these resources.

Query 6: What is known about the social backgrounds of 19th-and 20th-century British
6 Conclusion §98 The weekly SPARQL workshop held by the MMM project began as a knowledge transfer activity designed to teach the practical skill of learning how to perform SPARQL queries, but gradually developed into a wider investigation of the use of SPARQL to analyze the data, explore broader types of research questions, and assess the research potential of the MMM aggregated dataset and its Knowledge Graph. The benefits of investing over 500 hours of staff time in learning and practicing SPARQL queries can be seen in various ways, beginning with a diagnostic approach to identifying limitations in the data aggregated by the MMM project. This includes areas (like the different types of events) where the data sources do not enable an optimum level of granularity in the MMM data model. The source datasets do not collect the same information or, sometimes, when they do collect the same information, it is not computationally accessible via the same methods. This is more than a matter of improved mapping and transformation. Information that is explicit in one dataset may be only inferred from another. Discrete pieces of information in one source may be stored in aggregated form in another. §99 Like most collection-based humanities datasets and their interfaces, the MMM data sources are designed to produce lists of items (manuscripts) meeting certain criteria, rather than supporting statistical analyzes. The price data in the SDBM, for example, are purely descriptive and do not provide an adequate basis for quantitative analysis, even within a SPARQL query. On the other hand, some contextual information that is outside the scope of the source datasets can be added on-the-fly in SPARQL queries, as our work with person data from Wikidata shows. This also reinforced the importance of Linked Open Data identifiers in enabling this kind of approach and raised some significant questions about future strategies for including identifiers in datasets like those used by MMM. §100 There are signs that being able to write SPARQL queries is becoming a useful practical skill for humanities researchers. The popular humanities data management, network analysis and visualisation environment nodegoat recently added functionality for using SPARQL queries to import contextual data from Linked Open Data sources, for example (nodegoat 2021). SPARQL remains challenging to learn, even when using a detailed and well-documented data model like MMM, and requires a certain amount of trial and error. The Yasgui interface used in the MMM workshop offers some diagnostic help with formulating queries correctly, but its main advantages are the built-in visualizations. Its new "Geo events" display which can produce timelines and mapbased event sequences has also been tested against MMM data. (See this query: https:// api.triplydb.com/s/u_-KEd-US.) But it would help to have a more visual approach to constructing the SPARQL queries themselves, in which data models and name spaces can be visualized for selecting entities and properties. One recent project has designed a visual interface for constructing SPARQL queries in the humanities, known as Gravsearch, but this has to be used within the Knora software package (Schweizer and Geer 2021). §101 More generally, the workshop resulted in a better understanding of how querying data in a computational context works. For the humanists on the team, learning the technical language and structures of SPARQL also showed them how to develop more ambitious approaches to the MMM data, transforming the traditional research questions that had shaped the initial data modelling work into more sophisticated and expansive queries that took full advantage of the MMM data model. As a result, the returned data from these queries better reflected the true value of the combined dataset for humanistic research. For the computer scientists, the more evolved approach to querying led to more understanding about the complex research questions that are of interest to manuscript researchers, and to better analysis to determine the success of the project. §102 As these case studies show, querying the MMM dataset via its SPARQL endpoint does not produce perfect results, or results that provide a definitive answer in the traditional sense to the research questions. The methodology presented in these case studies follows the principles of distant reading, whereby computational aggregation and analysis of the data presented in returned results brings new insights into and raises new questions about the nature of the data and the subject it represents-in this case pre-modern manuscripts (Moretti 2013). While one would not want to draw hard conclusions from the results achieved in these queries, we hope to have shown that the process of learning and experimenting in a SPARQL environment brings three important benefits: 1) a better understanding of a complex and imperfect dataset, 2) a better understanding of how manuscript description and associated data involving the people and institutions involved in the production, reception, and trade of premodern manuscripts needs to be presented to better facilitate computational research, and 3) an awareness of the need to further develop data literacy skills among researchers in order to take full advantage of the wealth of unexplored data now available to them in the Semantic Web (Koltay 2015).