Nanopublication beyond the sciences: the PeriodO period gazetteer

The information expressed in humanities datasets is inextricably tied to a wider discursive environment that is irreducible to complete formal representation. Humanities scholars must wrestle with this fact when they attempt to publish or consume structured data. The practice of “nanopublication,” which originated in the e-science domain, offers a way to maintain the connection between formal representations of humanities data and its discursive basis. In this paper we describe nanopublication, its potential applicability to the humanities, and our experience curating humanities nanopublications in the PeriodO period gazetteer. Subjects Digital Libraries, World Wide Web and Web Science

Humanities scholars who wish to make their research materials usable with networked digital tools 2 face a common dilemma: How can one publish research materials as "data" without severing them 3 from the ideas and texts that originally gave them meaning? The kinds of information produced in the 4 humanities-biographical details, political and temporal boundaries, and relationships between people, 5 places, and events-are inextricably tied to arguments made by humanities scholars. Converting all, or 6 even much, of the information expressed in scholarly discourse into algorithmically processable chunks 7 of formal, structured data has so far proven to be extraordinarily difficult. 8 But rather than attempt to exhaustively represent her research, a scholar can promote small pieces of 9 information within her work using the practice of nanopublication (Mons and Velterop, 2009). Nanop-10 ublications include useful and usable representations of the provenance of structured assertions. These 11 representations of provenance are useful because they allow consumers of the published data to make 12 connections to other sources of information about the context of the production of that data. In this way, 13 they strike a balance between the needs of computers for uniformity in data modeling with the needs of 14 humans to judge information based on the wider context of its production. An emphasis on connecting 15 assertions with their authors is particularly well-suited for the needs of humanities scholars. By adopting 16 nanopublication, creators of datasets in the humanities can focus on publishing small units of practically 17 useful, curated assertions while keeping a persistent pointer to the basis of those claims-the discourse of 18 scholarly publishing itself-rather than its isolated representation in formal logic. 19 We offer as an example of this approach the PeriodO period gazetteer, which collects definitions 20 of time periods made by archaeologists and other historical scholars. 1 A major goal of the gazetteer 21 was to make period definitions parsable and comparable by computers, while also retaining links to 22 the broader scholarly context in which they were conceived. We found that a nanopublication-centric 23 approach allowed us to achieve this goal. In this paper, we describe the concept of nanopublication, its 24 origin in the hard sciences, and its applicability to the humanities. We then describe the PeriodO period 25 gazetteer in detail, discuss our experience mapping nonscientific data into nanopublications, and offer 26 advice to other humanities-oriented projects attempting to do the same.

28
Nanopublication is an approach to publishing research in which individual research findings are modeled 29 as structured data in such a way that they retain information about their provenance. This is in contrast to 30 both traditional narrative publishing, where research findings are not typically published in a structured, 31 computer readable format, and "data dumps" of research findings which are typically published without 32 any embedded information about their origin or production. The nanopublication approach is motivated by 33 a desire to publish structured data without losing the wider research context and the benefits of traditional 34 scholarly communication (Groth et al., 2010). 35 Nanopublication emerged from work in data-intensive sciences like genomics and bioinformatics, 36 where recent advances in computational measurement techniques have vastly lowered the barrier to 37 collecting genetic sequencing data. As a result, millions of papers have been published with findings 38 based on these new methods. However, the reported results are almost always published in the form of 39 traditional narrative scholarly publications (Mons et al., 2011). While narrative results can be read and 40 understood by humans, they are not so easily digested by computers. In fields where computation has 41 been the key to the ability to ask new and broader questions, it should surely be the case that research 42 results are published in such a way that they are able to be easily parsed, collected, and compared by 43 computer programs and the researchers who use them.

44
On the occasions when research data are released and shared, they are often distributed on their own, 45 stripped of the context necessary to locate them within a broad research environment (the identity of the 46 researchers, where and how this research was conducted, etc.). In this case, publishing practice has swung 47 too far to the opposite extreme. In the service of creating and sharing discrete datasets, the published 48 results have been stripped of their provenance and their position within the wider scholarly endeavor that 49 culminated in their publication. This contextual information is crucial for researchers to determine the 50 trustworthiness of the dataset and learn about the broader project of research from which they resulted.

51
Nanopublication offers a supplementary form of publishing alongside traditional narrative publications.

52
A nanopublication consists of three parts, all representable by RDF graphs: 53 1. An assertion (a small, unambiguous unit of information) 54 2. The provenance of that assertion (who made that assertion, where, when, etc.)

61
Authors are encouraged to include the smallest possible unambiguous pieces of information as the 62 assertions at the center of a nanopublication. In the bioscience context, these assertions could range from 63 statements of causality, to measurements of gene expressions or gene-disease associations, to statistics 64 about drug interactions. The scope and nature of appropriate units of nanopublication inevitably vary by and medical disciplines to eventually embrace nanopublication, he is less sure that nanopublication will 79 work for the humanities. Historians, for example, use relatively little specialized terminology and pride 80 themselves on their ability to use "ordinary language" to represent the past. Even when humanities 81 scholars use specialized theoretical language, their use of this language is often unstable, ambiguous, and 82 highly contested. Perhaps, then, a publishing technique that seeks to eliminate such ambiguity is ill-suited 83 for these fields.

84
A related obstacle to the adoption of nanopublication beyond the hard sciences has to do with 85 differences in the role played by "facts". Researchers trained in the hard sciences understand their work to 86 be cumulative: scientists "stand on the shoulders of giants" and build upon the work of earlier researchers.

87
While scientists can in principle go back and recreate the experiments of their predecessors, in practice they 88 do this only when the results of those experiments have not been sufficiently established as facts. Efficient 89 cumulative research requires that, most of the time, they simply trust that the facts they inherit work as 90 advertised. Something like this process seems to be assumed by many proponents of nanopublications.

91
For example, Mons and Velterop (2009) claim that a major goal of nanopublication is to "elevate" factual 92 observations made by scientists into standardized packages that can be accumulated in databases, at least 93 until they are proved wrong. These standardized packages can then be automatically or semi-automatically 94 analyzed to produce new factual observations (or hypotheses about potential observations), and the cycle 95 continues.

96
Yet as Mink (1966) observed, not all forms of research and scholarship are aimed at producing 97 "detachable conclusions" that can serve as the basis for a cumulative process of knowledge production. indexes of bibliographical data that make humanities scholarship possible (Buckland, 2006). Some of 122 these facts may be vague or uncertain, but as Kuhn et al. (2013) observe, even knowledge that cannot be 123 completely formally represented, including vague or uncertain scientific findings, can benefit from the 124 nanopublication approach. We agree but would go further to say that nanopublication is useful even for 125 information that is neither testable nor falsifiable, exemplified by Mink's synoptic judgments. We have 126 demonstrated the utility of nanopublications for describing synoptic judgments of historical periodization 127 in the PeriodO period gazetteer, which we describe below.

129
In their work, archaeologists and historians frequently refer to time periods, such as the "Classical

133
This leads to difficulty and repeated effort when scholars want to visualize their data in space and over 134 time, which requires mapping these discursive period labels to discrete spatiotemporal ranges (Rabinowitz,  This sentence contains two assertions defining period extents, so it is modeled in PeriodO as two period 160 definitions. The first definition has the label "Classical Iberian Period" and its start and end points are 161 labeled as "400 BC" and "200 BC" respectively. The second definition has the label "Early Iberian Period" 162 and its start and end points are labeled as "525 BC" and "400 BC" respectively. The spatial extent of both 163 definitions is labeled as "Catalan area". All of these labels are taken verbatim from the source text and 164 should never change. 165 Because they come from the same source, these two period definitions are grouped into a period 166 collection. The bibliographic metadata for the source article is associated with this period collection.

167
(In the event that a source defines only a single period, then the period collection will be a singleton.) 168 Belonging to the same period collection does not imply that period definitions compose a periodization.   Belarte's collection of period definitions is given in Figure 1. 6 188 PERIODO AS LINKED DATA 189 We have taken pains to make it easy to work with the PeriodO dataset, particularly keeping in mind  The start, latest start, earliest end, end approach enables us to represent the most common patterns  5 Proleptic refers to dates represented in some calendar system that refer to a time prior to that calendar's creation. The Gregorian calendar was adopted in 1582, but most of our dates fall in years prior to that one. 6 Turtle is a human-readable syntax for serializing RDF graphs (Carothers and Prud'hommeaux, 2014).

Manuscript to be reviewed
Computer Science uncertainty in order to maximize precision and recall with respect to temporal relevance judgments made 235 by experts. We have chosen not to support such more complex representations at this time because we 236 are focused primarily on representing periods as defined in textual sources. Natural language is already 237 a compact and easily indexable way to represent imprecision or uncertainty. Rather than imposing an 238 arbitrary mapping from natural language to parameterized curves, we prefer to maintain the original 239 natural language terms used. However if scholars begin defining periods with parameterized curves 240 (which is certainly possible) then we will revisit this decision. statements that refer to the version that they were descended from. 255 We publish a changelog at http://n2t.net/ark:/99152/p0h#changelog that represents   The current version of the Nanopublication Guidelines includes a note suggesting that the guidelines 299 be amended to state that an assertion published as a nanopublication should be "a proposition that is 300 falsifiable, that is to say we can test whether the proposition is true or false" (Groth et al., 2013). Were 301 this amendment to be made, PeriodO nanopublications would be in violation of the guidelines, as period 302 definitions in PeriodO, like most of the information produced in the humanities, are neither testable nor 303 falsifiable. Consider the assertion "there is a period called the Late Bronze Age in Northern Europe, and it 304 lasted from about 1100 B.C. to 500 B.C." The "Late Bronze Age" is a purely discursive construct. There 305 was no discrete entity called the "Late Bronze Age" before it was named by those studying that time and 306 place. Consequently, one cannot disprove the idea that there was a time period called the "Late Bronze 307 Age" from around 1100 B.C. to 500 B.C.; one can only argue that another definition has more credence 308 based on non-experimental, discursive arguments.

309
The proposed falsifiability requirement makes sense in certain contexts. Computational biologists, for 310 example, wish to connect, consolidate, and assess trillions of measurements scattered throughout a rapidly 311 growing body of research findings. Their goal is to create a global, connected knowledge graph that can 312 be used as a tool for scientists to guide new discoveries and verify experimental results. In the PeriodO 313 context, however, we are not concerned with making an exhaustive taxonomy of "correct" periods or 314 facilitating the "discovery" of new periods (a non sequitur-there are no periods that exist in the world 315 that are awaiting discovery by some inquiring historian or archaeologist). Instead we are interested in 316 enabling the study and citation of how and by whom time has been segmented into different periods. It is 317 not necessary that these segmentations be falsifiable to achieve this goal; they only need to be comparable. 318 Kuhn et al. (2013) expressed concern that requiring formal representation for all scientific data 319 published as nanopublications "seems to be unrealistic in many cases and might restrict the range of  Provenance is particularly important for non-scientific datasets, since the assertions made are so dependent 325 on their wider discursive context. When assertions cannot be tested experimentally, understanding context 326 is critical for judging quality, trustworthiness, and usefulness.   (4) updates to show the distribution of temporal extents defined by these various sources. Users can query for period definitions with temporal extents within a specific range of years using the time range facet (5), period definitions with spatial extents within a named geographic area using the spatial coverage facet (6), or period definitions in specific languages using the language facet (7). Queries may combine values from any of these facets. uncertain and partially arbitrary precision suggested by "around the beginning of the 12th century BC", 337 we cannot assume the same of computers. Therefore, in order for our dataset to be readily algorithmically 338 comparable, we had to map discursive concepts to discrete values. Our curatorial decisions in this regard 339 reflect a compromise between uniformity, potential semantic expressiveness, and practical usefulness.

340
As humanities scholars publish their own nanopublications (or linked data in general), they will go 341 through similar curatorial processes due to the interpretive, unstandardized nature of humanities datasets 342 discussed above. There is a temptation in this process to imagine perfect structured descriptions that 343 could express all possible nuances of all possible assertions. However, chasing that goal can lead to 344 overcomplexity and, in the end, be practically useless. In describing period assertions as linked data, we 345 adopted a schema that was only as semantically complicated as was a) expressed in our collected data and 346 b) necessitated by the practical needs of our intended users. As we started to collect data, we considered 347 the basic characteristics of a dataset that would be necessary to accomplish the retrieval and comparison 348 tasks that our intended users told us were most important. These tasks included:  To encourage a consistent representation of temporal extent for all period definitions, we built a simple 384 grammar and parser for date expressions that covered the vast majority of our sample data. The parser 385 takes in a string like "c. mid-12th century" and outputs a JSON string consistent with our data model.

386
It can also produce naïve interpretations of descriptions like "mid-fifth century", assigning them to the 387 third of the epoch described according to the conventional segmentation of "early," "mid," and "late." the Portable Antiquities Scheme database of archaeological finds in the UK. 9 .

403
As more projects begin to integrate PeriodO identifiers for time periods, we hope to gather information 404 7 http://opencontext.org 8 http://ariadne-infrastructure.eu 9 https://finds.org.uk Figure 4. Part of the interface for editing period definitions. Labels for temporal extent boundaries are taken verbatim from the source, entered as free text, and automatically parsed into ISO 8601 year representations. Labels for spatial coverage are entered as free text, and using an autocompletion interface the user can specify the modern-day administrative units (e.g. nation-states) that approximate this spatial coverage.
on their citation and use. This would include both studying the historical use of attributed period definitions 405 as well as tracking the citation of PeriodO period identifiers going forward. Such a study would allow us 406 to observe how periods come into circulation and fall out of favor. Tracing the connections fostered by 407 use of our gazetteer would demonstrate the potential benefits of a linked data approach in the humanities. 408 We are also in the process of reaching out to period-defining communities beyond classical archaeology 409 and ancient history. We expect that this will require some extensions of and revisions to the current 410 PeriodO data model. First, as we begin to collect definitions of periods closer to the present, we expect to 411 extend our model of temporal extent to allow for more fine-grained interval boundaries than years. This 412 will require a unit of representation that allows comparisons between intervals defined at different levels 413 of granularity. (The approach based on Julian Days, described in Table 1, may be useful for this.) Second, 414 as we begin to include more non-Western period definitions, we will need to ensure that we can still map 415 years to ISO 8601 representations. At the very least, this will require extending the temporal expression 416 parser, and it may require changes to the data model as well, for example to state explicitly the calendar 417 system used by the original authors. Finally, as more historians begin publishing their work as datasets or 418 software, we may begin to encounter periods defined not in natural language but using some formalism, 419 such as the curves proposed by Kauppinen et al. (2010). These will require us to find a way of including 420 these formalisms directly in our definitions.

422
As scholars of all disciplines continue to integrate computational methods into their work, the need to 423 preserve provenance will only become more important. This is as true in the humanities and social 424 sciences as it is in the natural sciences. Nanopublication is an useful way to locate the production of "data" 425 within a wider scholarly context. In this way, it echoes old ideas about hypertext which were concerned 426 with relations of provenance, authorship, and attribution (Nelson, 1999). The PeriodO period gazetteer 427 shows that this approach is relevant and feasible even to fields outside of the experimental, observable