Implementing Metadata that Guide Digital Preservation Services

Effective digital preservation depends on a set of preservation services that work together to ensure that digital objects can be preserved for the long-term. These services need digital preservation metadata, in particular, descriptions of the properties that digital objects may have and descriptions of the requirements that guide digital preservation services. This paper analyzes how these services interact and use these metadata and develops a data dictionary to support them. 1 This paper is based on the paper given by the authors at iPRES 2009; received January 2010, published March 2011. The International Journal of Digital Curation is an international journal committed to scholarly excellence and dedicated to the advancement of digital curation across a wide range of sectors. ISSN: 1746-8256 The IJDC is published by UKOLN at the University of Bath and is a publication of the Digital Curation Centre. Angela Dappert and Adam Farquhar 239 Introduction Effective digital preservation requires a set of preservation services that work together to ensure that digital objects can be kept alive for the long-term. In order to work together, these services need digital preservation metadata such as descriptions of the properties that digital objects may have and descriptions of the requirements that guide digital preservation services. This paper analyzes how these services interact and use these metadata. From this it develops a data dictionary to support them. Related Work Digital preservation metadata are the information that is essential to ensure longterm accessibility of digital resources. Analyses of the goals of long-term digital preservation have led to a solid understanding of the types of metadata that are needed. Good overviews are provided in Caplan (2006) and Lavoie and Gartner (2005). In 2002, the Reference Model for an Open Archival Information System (OAIS; Consultative Committee for Space Data System (CCSDS), 2002) provided a framework to unify the concepts and terminology in the community. Its information model (Online Computer Library Center/Research Libraries Group (OCLC/RLG), 2002) defines categories for preservation metadata. In 2005 the Preservation Metadata Implementation Strategies (PREMIS) data dictionary consolidated several earlier efforts (e.g., CURL Exemplars in Digital Archives (Cedars) Project; Lupovici & Masanès, 2000; National Library of Australia, 1999; National Library of New Zealand, 2003) to produce conceptual models and concrete metadata dictionaries for implementers of digital preservation services. Now in its second version (PREMIS, 2008), it has been widely accepted and plays a key role in creating coherence in the digital preservation metadata community. PREMIS provide a foundation to support interoperability across systems and organizations. Many of the entries in today’s data dictionaries are, however, still vague. They await increased practical experience to establish the proper level of granularity. They also tend to be focused on statically recording characteristics and events rather than on dynamically supporting preservation processes. Contributions This paper draws on the practical experience gained in the project: Preservation and Long-term Access through Networked Services (Planets; Farquhar & Hockx-Yu, 2007), a four-year project co-funded by the European Union to address core digital preservation challenges. It analyzes how preservation services interact and use preservation metadata. From this, it derives information needed to capture key preservation metadata elements, such as property, characteristic, and requirement. Finally, it develops a data dictionary to support the analysis. The approach handles conflicting values from multiple sources. It also supports dynamic preservation processes, in addition to static recording of characteristics and events. It is based on a conceptual model of digital preservation that is theoretically and empirically founded (Dappert, Ballaux, Mayr & van Bussel, 2008; Dappert & Farquhar, 2009a). The model 2 PREMIS data dictionary for preservation metadata (Version 1): http://www.oclc.org/research/projects/ pmwg/premis-final.pdf . 3 Cedars Project: http://www.webarchive.org.uk/wayback/archive/20050410120000/http://www.leeds.ac.uk/cedars/index. html. 4 Planets: http://www.planets-project.eu/. The International Journal of Digital Curation Issue 1, Volume 6 | 2011 240 Implementing Metadata has consequences for implementations of preservation metadata dictionaries, property registries, and preservation services. Properties, Values, Characteristics, and Requirements In order to write with a reasonable level of precision, we need to introduce a basic vocabulary (Chaudhri, Farquhar, Fikes, Karp, & Rice, 1998): • Entity: anything whatsoever. • Class: a class is a set of entities. Each of the entities in a class is said to be an instance of the class. • Individual: entities that are not classes are referred to as individuals. • Property: a property is an individual that names a relationship. • Characteristic: a property / value pair associated with an entity. The value is an entity. • Facet: a facet is a property / value pair associated with a characteristic. The value is an entity. • Constraint: a Boolean condition involving expressions on entities. • Requirement: a constraint in a specific context. Unless otherwise specified, a characteristic is directly associated with entities. Furthermore, we say that a property applies to classes if it can be meaningfully associated with some instances of these classes. We can use this language in the domain of digital objects and preservation. For example, file is a class; f1.txt is an instance of the class file; fileSize is a property; the property fileSize applies to file; the file f1.txt has the characteristic fileSize = 131342. The constraint language can be used to express richer relationships. For example, suppose a is a bitPreservationAction, fIn is the initial file, and fOut is the result of applying action a to fIn, then the constraint fileSize(fIn) = fileSize(fOut) should hold. Important additional information about a characteristic, such as how a value is encoded, the unit of measure, or the algorithm or tool used to compute it can be specified using facets. The core classes in the digital preservation domain are preservationObject, preservationAction, and environment. The preservationObject concept corresponds to those objects in need of preservation. In our conceptual model (Dappert et al., 2008; Dappert & Farquhar, 2009a, 2009b) it has the subclasses bitstreams (including bytestreams and files), representations of logical objects consisting of representation bitstreams that are needed to create a single rendition of a logical object, and logical objects such as intellectualEntities and components. An 5 An additional typeface (e.g. file) is used in this paper to indicate descriptions of operations and files used by the authors’ computer program. The International Journal of Digital Curation Issue 1, Volume 6 | 2011 Angela Dappert and Adam Farquhar 241 intellectualEntity is a distinct intellectual or artistic creation, a set of content that is considered a single intellectual unit for purposes of management and description. Finer grained components of an intellectualEntity are needed to characterize its parts. The preservationAction concept corresponds to actions taken by custodians of digital content to mitigate the risks that they identify. The environment concept corresponds to hardware and software environments, the community, budgetary factors, the legal system, and other internal and external factors. An environment or sub-environment can be associated with a preservationObject or preservationAction. Uses Figure 1 illustrates the roles that properties, values, characteristics, and requirements (represented by ovals) play in preservation services (represented by boxes). By analyzing the specific roles that they play in these services, we can derive additional requirements for our data dictionary. This will be discussed in the following sections. Metadata Storage Service


Introduction
Effective digital preservation requires a set of preservation services that work together to ensure that digital objects can be kept alive for the long-term. In order to work together, these services need digital preservation metadata such as descriptions of the properties that digital objects may have and descriptions of the requirements that guide digital preservation services. This paper analyzes how these services interact and use this metadata. From this it develops a data dictionary to support them.

Related Work
Digital preservation metadata is the information that is essential to ensure long-term accessibility of digital resources. Analyses of the goals of long-term digital preservation have led to a solid understanding of the types of metadata that are needed. Good overviews are provided in Caplan [3] and Lavoie [13]. In 2002, OAIS [4] provided a framework to unify the concepts and terminology in the community. Its information model [19] defines categories for preservation metadata. In 2005 the PREMIS data dictionary consolidated several earlier efforts [e.g. 5,14,17,18] to produce conceptual models and concrete metadata dictionaries for implementers of digital preservation services. Now in its second version [20], it has been widely accepted and plays a key role in creating coherence in the digital preservation metadata community. PREMIS provides a foundation to support interoperability across systems and organizations. Many of the entries in today's data dictionaries are, however, still vague. They await increased practical experience to establish the proper level of granularity. They also tend to be focused on statically recording characteristics and events rather than on dynamically supporting preservation processes.

Contributions
This paper draws on the practical experience gained in Planets [10], a four-year project co-funded by the European Union to address core digital preservation challenges. It analyzes how preservation services interact and use preservation metadata. From this, it derives information needed to capture key preservation metadata elements, such as property, characteristic, and requirement. Finally, it develops a data dictionary to support the analysis. The approach handles conflicting values from multiple sources. It also supports dynamic preservation processes, in addition to static recording of characteristics and events. It is based on a conceptual model of digital preservation that is theoretically and empirically founded [7,8]. The model has consequences for implementations of preservation metadata dictionaries, property registries, and preservation services.

Properties, Values, Characteristics and Requirements
In order to write with a reasonable level of precision, we need to introduce a basic vocabulary [6]:  Entity -Anything whatsoever.  Class -A class is a set of entities. Each of the entities in a class is said to be an instance of the class.  Individual -Entities that are not classes are referred to as individuals.  Property -A property is an individual that names a relationship.  Characteristic -A property / value pair associated with an entity. The value is an entity.  Facet -A facet is a property / value pair associated with a characteristic. The value is an entity.  Constraint -A Boolean condition involving expressions on entities.  Requirement -A constraint in a specific context. Unless otherwise specified, a characteristic is directly associated with entities. Furthermore, we say that a property applies to classes if it can be meaningfully associated with some instances of these classes. We can use this language in the domain of digital objects and preservation. For example, file is a class; f1.txt is an instance of the class file; fileSize is a property; the property The constraint language can be used to express richer relationships. For example, suppose a is a bitPreservationAction, fIn is the initial file, and fOut is the result of applying action a to fIn, then the constraint fileSize(fIn) = fileSize(fOut) should hold. Important additional information about a characteristic, such as how a value is encoded, the unit of measure, or the algorithm or tool used to compute it can be specified using facets.
The core classes in the digital preservation domain are preservationObject, preservationAction and environment. The preservationObject concept corresponds to those objects in need of preservation. In our conceptual model [7,8,9] it has the subclasses bitstreams (including bytestreams and files), representations of logical objects consisting of representation bitstreams that are needed to create a single rendition of a logical object, and logical objects such as intellectualEntities and components. An intellectualEntity is a distinct intellectual or artistic creation, a set of content that is considered a single intellectual unit for purposes of management and description. Finer grained components of an IntellectualEntity are needed to characterise its parts. The preservationAction concept corresponds to actions taken by custodians of digital content to mitigate the risks that they identify.
The environment concept corresponds to hardware and software environments, the community, budgetary factors, the legal system, and other internal and external factors. An environment or sub-environment can be associated with a preservationObject or preservationAction. Figure 1 illustrates the roles that properties, values, characteristics and requirements (represented by ovals) play in preservation services (represented by boxes). By analyzing the specific roles that they play in these services, we can derive additional requirements for our data dictionary. This will be discussed in the following sections.

Uses of Properties and Controlled Vocabulary
Properties and controlled vocabulary can be captured in registries so that they can be referred to in other services. Alternatively they can be defined locally for local use in a system. File format registries, such as PRONOM [15] or UDFR [22], can associate file formats with their applicable properties. Characteristics extraction languages, such as XCEL [21], additionally describe how values for these properties can be extracted from files in a given format. Preservation metadata dictionaries, such as PREMIS [20], define common preservation metadata elements to describe properties of preservation objects or environments. Controlled vocabulary registries, such as the planned Authorities and Vocabularies service of the Library of Congress, capture these properties' permissible values ( Figure 2). We can use this information to  link a format to characterization services that can determine values for its applicable properties -for example, a service to determine the fonts used in a .doc 1 file ( Figures 1A, 3).  create a testbed service that measures the degree to which applicable properties are preserved by preservation services -for example, measure the degree to which a service preserves imageWidth by evaluating it on many objects. In addition to the service characteristics (e.g. preservesImageWidth = "no") it could capture the degree to which or under what condition this characteristic holds ( Figures 1A, 3).  enable metadata storage services to refer to properties unambiguously and ensure interoperability and exchange across institutions and systems ( Figure 1B).  identify properties that are shared across file formats and can therefore be preserved by a migration between them ( Figure 2).

Characterization
Characterization services determine the characteristics of preservation objects. Characteristics are property/value pairs. They are used to describe preservation objects, environments, and preservation actions. In particular,  characteristics can be extracted automatically by characterization tools [11,16,21] or assigned manually ( Figure 3).  characteristics of preservation services can be determined experimentally in preservation testbeds [e.g. 1] (Figure 3).
 characteristics may be stored in metadata storage services or produced on demand ( Figure 1C). We refer to file formats via common file extensions as a shorthand for improved readability. A precise statement requires a unique identifier corresponding to an exact version of the format.

Business Modelling
Business modelling results in the formulation of requirements from properties and controlled vocabulary ( Figure 1D). Requirements reflect the stakeholders' values, goals and constraints with regard to objects and guide preservation services.

Figure 4: Requirements
They may be captured in preservation guiding documents, such as policy, strategy or business documents. They may also be part of the preservation metadata captured in metadata storage services that documents the constraints that have been or should be applied to specific preservation objects (See Figure 1C). The PREMIS data dictionary for preservation metadata [20] accommodates recording "significant properties" which are a form of preservation guiding requirement. Requirements may also be captured in reusable, customizable user profiles which describe the requirements of a default designated community. (Figure 4)

Uses of Characteristics and Requirements
Optional pre-selection services ( Figure 1E) may provide an optimization step which rules out implausible preservation actions. They analyze requirements to eliminate actions which can from the outset be determined to be violated by characteristics in a given context. Knowledge about the characteristics of preservation services, which has been obtained in testbed services, is particularly helpful in this step.  Primarily, requirements guide actions, such as preservation monitoring, preservation planning and preservation execution services (See Figure 5). Preservation monitoring services determine whether risk specifying requirements are violated and, therefore, preservation risks exist. A preservation monitoring process should trigger the preservation planning process once this happens. Using a sample data set, preservation planning services (e.g. Plato, [2]) determine the best choice of preservation service to mitigate this preservation risk, with respect to preservation guiding requirements. The preservation execution service itself uses them to evaluate and validate each preservation action's output.
Once an action, such as preservation monitoring, preservation planning or preservation execution, has been chosen and executed it is validated in a requirements evaluation step. Requirement evaluators [e.g. the XCDL comparator, 21] determine the degree to which characteristics of the preservation objects, preservation actions and environments before, during and after actions comply with requirements. The output is either an assessment of the presence and severity of a preservation risk, or a measure of the degree of compliance of an action with the set of requirements ( Figure 1F and 5). Requirements can also serve as explicit provenance information. A metadata storage service may document the provenance of a repository's objects. For each object, it may record the preservation actions that created it and the set of requirements that applied at the time. It can also store the object's degree of compliance with respect to each requirement in the requirements set, especially its significant characteristics. Sometimes characteristics that are not referenced by any requirement are, however, lost during a preservation action; it is not, in general, possible to record their loss as they can not be listed exhaustively ( Figure 1G).
Actions can create new preservation objects and environments. Their characteristics may differ from those of the input preservation objects and environments ( Figure  1H). Some requirements may articulate constraints on the relationship between preservation action input and output. Many properties are applicable to only a subset of objects. For example, the property fontSize is applicable to formats which may contain text; it would not be applicable to an audio format. 2 In order to achieve a normalized representation, we link properties to the type of class to which it applies (see appliesTo in the data dictionary), rather than directly to file formats. Examples include bytestream, representation, intellectualEntity (e.g. eBook, soundRecording), component (e.g. textComponent, tableOfContents), preservationAction or environment (e.g. legalEnvironment, operatingSystem). This approach makes it easy to express that the fontSize property applies to textComponent objects. Figure 6 illustrates how it is straightforward to map properties to subclasses of component and file formats in turn.

Observation 2:
Properties sometimes refer to a combination of preservation objects, environments, or actions. Consider the relative size of two images, the absolute distance of a line from the text, and the metrics describing column layout. These all refer to several objects. The language that we use to define properties must be expressive enough to capture this.

Observation 3:
Properties are related to each other and their relationships have to be modelled explicitly. For example duration can be calculated from dateTimeRange. Furthermore, many file formats have similar, but not identical properties. Therefore, the language that we use to define properties must be able to capture the relationships between them and specify how to compare or convert them. Figure 7 illustrates this.
The association of properties with digital object types of files is discussed in the Planets testbed [12]. We are refining this to the type of a component of the digital object, since a logical object might well contain, for example, text, sound, and image components together.

Observation 4:
In many cases, it is useful to define one property in terms of others. For example, the aspectRatio of an image might be defined as imageWidth / imageHeight. As a result, it is essential to record how such properties are defined and derived in order to ensure consistency.

Observation 5:
For each property, it is essential to specify the tool or algorithm that can be used to determine a value and the types of sources from which they can be obtained. We refer to this as the value origin. Values originate when they are  Assigned manually (stored or on demand). When values are assigned manually they often need to comply with conventions, such as cataloguing rules, standards, controlled vocabularies, etc. This should be specified as part of the value origin.  Assigned automatically as a side-effect of a service (stored). Regular internal operations, such as ingest of digital objects, purchase of hardware and software, decommissioning of equipment, hiring, training and laying-off of staff, getting and spending money, or executing preservation actions, all change characteristics of preservation objects or their environments. Equally, external operations, such as introducing a new file format or a new preservation service, change characteristics. These value changes need to be captured if they serve as a basis for making preservation decisions. E.g. the contentType of objects in an eJournal ingest system is always set to "eJournal" upon ingest. E.g. the budget of an institution may be set during the execution of a preservation action: preservationBudgetSize:= preservationBudgetSize -preservationActionCost.
 Extracted (stored or on demand). The original source of derived values may be a bitstream or the set of representation bitstreams of a representation of a logical object. Values are extracted using a tool which implements an algorithm. The value origin should specify the algorithms and tools used. Examples: bytestreamSize may be extracted from the bytestream object. colorFidelity can be measured by averageColor or by histogramShape. wordCount can count hyphenated words as one or as multiple words. MIME type can be extracted using the JHOVE format characterization tool.  Inferred (stored or on demand). Values may be inherited in the preservation object hierarchy, derived through a function from values of other properties, or logically inferred.
The value origin should specify the algorithm that can be used to infer it. E.g. the aspectRatio of an image may be imageWidth / imageHeight.

Observations for Characteristics Observation 6:
Values for characteristics may be stored or derived on demand. On demand derivation can take place through characterization services or through retrieval from registries or inventories. 3 Whether they are stored or derived needs to be recorded since different preservation services will be chosen based on this property.

Observation 7:
There may be multiple values for a property of an object, since there may be several representations (sources) which form the basis of measurement for the value and several different measurement techniques (technique) and tools (creation agent). Characteristics and requirements need to specify which value origin is meant.

Observations for Requirements Observation 8:
In many cases, a stakeholder may express requirements dependent on additional conditions, e.g. If environmentType = "preservation" then image resolution must be preserved. As a result, the language that we use to define requirements must be expressive enough to include conditionals. Requirements can be expressed as constraints, such as through OCL [23] or other informal or formal languages.

Observation 9:
Not all requirements are equally important and not all have to be precisely satisfied. To accommodate this, it is useful for a stakeholder to add an importance factor, as a measure of relative importance, and potentially a tolerance factor, as a measure of the tolerable degree of deviation from the specified value, with each requirement. For example, preserving the number of lines on a page might be less important than preserving the number of pages. During requirements evaluation of a preservation action the importance and tolerance factors can be combined into a weighted measure.

CONCEPTUAL DETAILS
In this section, we build on the preceding analysis to specify the data model more completely. For each concept, we describe its key attributes and basic information such as its data type and whether it is mandatory or repeatable. We also introduce supplementary concepts such as ValueOrigin and Unit that are needed to represent properties. This data dictionary is informed by analysis undertaken in the Planets project. It will only be partially implemented during the project, but it serves as a basis for further development and implementation.

Property
Definition: An abstract attribute, trait or peculiarity suitable for describing a preservation object, action or environment.
 propertyIdentifier (1...1): a unique identifier of the Property (data constraint: Property ID).  propertyName (0...n): a meaningful human readable name (data constraint: string). It is repeatable in order to allow for synonyms. Different Properties may have the same names, but must have unique identifiers.  propertyDescription (0...n): a meaningful human readable description of the Property (data constraint:

Description)
 appliesTo (1...1): a list of Classes. This property can be meaningfully associated with Instances of these Classes (data constraint: vector of PreservationObject, Environment or PreservationAction subclasses). The vocabulary of subclasses is extensible and includes many subclasses not shown in this paper. See Dappert et al [7] for a sample vocabulary.

Value Origin
The ValueOrigin concept provides a way to specify where a specific Value comes from or how it can be obtained. There can be multiple ways of obtaining the Value of a Property that do not produce conflicting results. For example, they might be measured from different sources, measured by

Characteristic
Definition: A Characteristic of an Entity is the concrete Value which this Entity has for an abstract Property in a defined context. The requirementSpecification can be expressed informally or implemented using a constraint language such as OCL [OCL 2003]. In the latter case, each pre-and postcondition is an expression that can be evaluated against the Characteristic Values specified in the Requirement's context. In some implementations, these will evaluate to simple Boolean values (true or false). Other implementations will allow for a tolerance. In this case, the requirementImportanceFactor and tolerance can be used to compute a weighted measure of compliance with the Requirement.

CONCLUSION
This article has presented a data dictionary for key digital preservation metadata concepts. The underlying conceptual model supports dynamic preservation processes, rather than the static recording of characteristics and events. The data dictionary has been motivated by observations about its intended uses and the interactions between preservation services. The model has consequences for implementations of preservation metadata dictionaries, property registries, and preservation services. This work has been conducted within the larger context of defining a conceptual model and specific vocabulary for supporting preservation services within the PLANETS project and is theoretically and empirically founded [7,8].