Mapping ISO 19115-1 geographic metadata standards to CodeMeta

The CodeMeta Project recently proposed a vocabulary for software metadata. ISO Technical Committee 211 has published a set of metadata standards for geographic data and many kinds of related resources, including software. In order for ISO metadata creators and users to take advantage of the CodeMeta recommendations, a mapping from ISO elements to the CodeMeta vocabulary must exist. This mapping is complicated by differences in the approaches used by ISO and CodeMeta, primarily a difference between hard and soft typing of metadata elements. These differences are described in detail and a mapping is proposed that includes sixty-four of the sixty-eight CodeMeta V2 terms. The CodeMeta terms have also been mapped to dialects used by twenty-one software repositories, registries and archives. The average number of terms mapped in these cases is 11.2. The disparity between these numbers reflects the fact that many of the dialects that have been mapped to CodeMeta are focused on citation or dependency identification and management while ISO and CodeMeta share additional targets that include access, use, and understanding. Addressing this broader set of use cases requires more metadata elements.


INTRODUCTION
The CodeMeta Project (CodeMeta Project, 2018) recently proposed (1) a vocabulary for documenting software, (2) mappings between metadata fields used by a broad range of software repositories, registries and archives (CodeMeta Crosswalk, 2018), and (3) developed software with the purpose of facilitating automatic translation between different representations of software metadata. The vocabulary was designed to support several different software use cases, including citation, discovery, use and, to some degree, understanding.
The ISO Technical Committee 211 has developed generic metadata standards that are widely used for geographic data of many kinds. These standards were also designed as a foundation that can be built on to document many kinds of things and support many use cases (Habermann, 2018a;Habermann, 2018b;Habermann, 2018c).
This paper describes a mapping between the conceptual model that underlies ISO metadata (ISO 19115-1, 2014) and CodeMeta with the goal of facilitating the creation of CodeMeta-compliant descriptions of software that is documented using the ISO standards. The communities that developed these two metadata dialects share the important goal of comprehensive standards that address multiple use cases for many disciplines. Both groups pursue this goal by developing consensus, but the details of the processes used to develop their standards differ. ISO TC211 represents a traditional International standards body with well-defined processes, publication methods and a business model that includes costs to users for standards documents. CodeMeta represents a community of volunteer practitioners with an initial set of proposed conventions open on the Web and an invitation for adoption, experimentation and evolution.
In addition to these process differences, there are also differences between the structures and implementations of these two models. These are described below with the mappings following.

Dialect coverage and scope
Mapping metadata for software between different schemas and dialects is an important technical goal of CodeMeta. This goal is supported using a crosswalk file that is maintained and contributed to in the CodeMeta Git repository (CodeMeta Crosswalk, 2018). This file lists the CodeMeta terms along with equivalents in twenty-one dialects. This crosswalk is the basis for translating content between these dialects.
A similar situation occurs in many in science communities that are trying to support multiple use cases, i.e., document, share, and trust, for datasets using multiple metadata dialects (see Gordon & Habermann, 2018). The concept of ''Dialect Coverage'' has come up in those studies as the amount (%) of the concepts in a particular recommendation that a dialect includes. In the CodeMeta case, this is the number of CodeMeta concepts that can be represented in the dialects listed in the crosswalk file (Crosswalk Data, 2018). Figure 1 shows this count for each of the twenty-one dialects. These counts were determined from the crosswalk file by counting the number of cells with content in each column using a spreadsheet count() function. Both versions of CodeMeta and ISO 19115-1 (2016) are included in this figure as well.
The data show that the ISO dialect covers very close to all (64/68) of the CodeMeta concepts and the other twenty-one dialects cover an average of 11.23/68, suggesting that CodeMeta is more similar to ISO than it is to other dialects that have done crosswalks. The difference between the ISO mapping and others is striking. It likely reflects the difference between the small number of metadata elements used for discovering and citing software (or data) and the larger number needed to be able to use it and trust it. In the current software citation landscape, this is the difference between the complete CodeMeta vocabulary, i.e., all metadata for code (over sixty items), and the FORCE11 Software Citation Guidelines, i.e., metadata for code citation (Smith, Katz & Niemeyer, 2016) which includes only ten items.
In some cases, multiple CodeMeta terms are mapped to the same ISO elements. Most of this ambiguity is related to differences in the two models and approaches that are discussed in detail below. ISO elements that are mapped to more than one CodeMeta term are identified with * in the crosswalk tables below.

Model characteristics
The ISO metadata standards are based on a UML model that is harmonized across all standards developed and managed by the committee. The model is built around classes Figure 1 Coverage of CodeMeta concepts in multiple dialects. The average number of CodeMeta concepts covered by twenty-one dialects is 11.2. The ISO dialect covers sixty-four of sixty-eight CodeMeta concepts.
Full-size DOI: 10.7717/peerjcs.174/ fig-1 and attributes that describe the structure of the standards and the relationships among objects. ISO 19115-1 (2014) includes thirteen top-level classes that provide details on identification, content, constraints, distribution, quality, usage, reference systems, spatial representation and several other areas. The ISO standard includes a scope element at the root of each record that gives the type of resource described by the metadata. The default scope is dataset, but other options include : aggregate, application, attribute, attributeType, collection, collectionHardware, collectionSession, coverage, dimensionGroup, document, feature, featureType, fieldSession, initiative, metadata, model, nonGeographicDataset, otherAggregate, platformSeries, product, productionSeries, propertyType, repository, sample, sensor, sensorSeries, series, service, software, tile, transferAggregate (see Habermann, 2018a;Habermann, 2018b;Habermann, 2018c). Mapping the CodeMeta vocabulary to the ISO standard is an initial step toward defining the content that could be included in ISO metadata records that describe software and applications, i.e., those where the scope is software.
The most commonly used representation of the ISO standards is XML (ISO 19115-1, 2016). ISO XPaths uniquely identify metadata content and follow the structure of the UML model, with levels in the XML alternating between objects (with types) and properties. This results in XML that is ''striped'' like the XML representation of the Resource Description Framework (RDF, 2018) (W3C, 2014), i.e., role/type/role/type/content. Types generally start with two uppercase letters (MD, CI, . . . ) that indicate the UML package that they are defined in (metadata, citation, . . . ) followed by an underscore (MD_, CI_, . . . ). Properties (termed roles in this discussion) are in lower camel case.
A significant benefit of the striped XML is that properties can be defined with abstract objects that can share properties while being instantiated with different types. For example, the ISO CI_Party object is abstract and includes name and contactInfo properties. It is extended and specialized by CI_Individual and CI_ Organisation objects which inherit name and contactInfo properties and add properties that are relevant for people and organizations, e.g., organizations can include individuals, logos, and position names. This approach also facilitates reuse by allowing standard objects (e.g., people, organizations, or citations) to be referenced using links rather than repetitive content (https://geo-ide.noaa.gov/wiki/index.php?title=ISO_Components).
Another benefit of this approach in ISO is the same as that in the schema.org casecommunities can extend object definitions when necessary and, in the ISO case, the resulting extended objects fit naturally into the ISO XML representation. This approach is similar to the schema extension model used in CodeMeta to add properties deemed important by the CodeMeta community to the more general SoftwareSourceCode schema that is also a specialization of the schema.CreativeWork schema.
The namespace for each element in the XML is identified using a standard namespace prefix (mdb, cit, . . . ). Asterisks are used in the XPaths to indicate locations where several objects can be used. For example, mdb:identificationInfo/*/ indicates that either mdb:MD_ DataIdentification or srv:SV_ServiceIdentification objects can occur in that location.
A simplified notation is introduced for paths through the UML conceptual model in this document that includes only the role names and no information that is specific to the XML representation. For example, the XPath /mdb:MD_Metadata/mdb:identificationInfo/ mri:MD_ DataIdentification/mri:resourceSpecificUsage/mri:MD_ Usage/mri:identifiedIssues/ cit:CI_ Citation/cit:onlineResource/cit:CI_OnlineResource/cit:linkage is replaced by the concept path: identificationInfo.resourceSpecificUsage.identifiedIssues.onlineResource.linkage. These simplified ''concept paths'' improve readability and emphasize equivalences between CodeMeta and ISO in the conceptual space. Specific XPaths can be constructed from these concept paths when necessary to implement translation of existing ISO content to CodeMeta representations. The reverse translation is not unique.
CodeMeta specifies a vocabulary rather than a structural model. It includes properties from several schema.org schemas listed in Table 1 along with the number of items from each schema (Crosswalk Data, 2018). These schemas exist in a schema.org hierarchy which is similar in many ways to the ISO structure. SoftwareApplication and SoftwareSourceCode schemas are both specializations of the Thing >CreativeWork schema. CodeMeta extends these schemas (in CodeMeta.SoftwareSourceCode) with several properties that lack clear equivalents in schema.org.

Hard types and soft types
All standards and vocabularies need to make choices between hard or soft typing of objects they are describing. Hard typing requires specific names for items and is the only choice available in implementations where names alone can be used to distinguish between items, e.g., CodeMeta. For example, if publication and revision dates are required for complete descriptions of a resource, hard typed representations would include two items: e.g., publicationDate and revisionDate. Soft Typing can be used in dialects which support item attributes as well as values, e.g., XML. In that case, these two dates would be represented with the same name (XML element) and distinguished by a type attribute: <datetype=''publication''>and <datetype=''revision''>.
The difference between these two approaches emerges as the dialects evolve. Hard types evolve by adding new elements to the underlying model, e.g., adding creationDate (or some other type of date) when it becomes apparent that it is needed, and unambiguous definitions of those elements. Soft types evolve by adding items to the shared vocabulary of date types, typically a codelist or thesaurus.
The critical difference between hard and soft types boils down to differences in governance models and change tolerance. In communities that use hard types, members must be tolerant to changes in the models and, typically, changes in tooling built on them. Communities that use soft typing must have mechanisms for sharing and evolving vocabularies, typically control bodies or rules.
The ISO model is soft-typed and the CodeMeta model is hard-typed. Table 2 lists some of the documentation concepts that illustrate the contrast between these approaches. The first row shows the differences in how dates are treated. The CodeMeta vocabulary includes four types of dates listed in the second column of Table 2. If other date types are required to describe software, maybe dateDeprecated for example, new terms would be found in schema.org or added to the vocabulary to address those needs. The ISO approach involves a single date concept and a codelist that includes sixteen options, shown in the third column of Table 2. That codelist is designed to be extended by communities with other needs without impacting the structure of the standard. In this example, the date type codelist already includes the term ''deprecated''.

Citations
Connecting users to resources is one of the most important roles of metadata. It is also one of the most ubiquitous. Several classes of citations are important. Citation to the resource being described in the metadata (Resource Citation). The role of these citations is to provide guidance on how the resource being described should be cited and there is only one of these in each metadata record. Citations to related resources (Related Resource Citation). These are generic references to some other resource and generally include information about the relationship between the resource being described and the related resource. See, for example, the RelatedIdentifier element in the DataCite metadata schema (DataCite Metadata Working Group, 2017) which includes relatedIdentifierType and relationType attributes as additional information. Citations to other, typically specific, resources (Specific Resource Citations). For example, the ISO object that describes data processing includes a citation in the role of softwareReference that specifically provides a reference to software used in the processing. Other examples of these citation types are included in the following discussion.

ISO citations
ISO 19115-1 includes all three types of citations: • The Resource Citation is unique and occurs at a specific location in the conceptual model: identificationInfo.citation (XPath = /mdb:MD_Metadata/mdb:identification Info/*/mri:citation/cit:CI_Citation).
• Specific Resource Citations occur in a number of locations in the ISO model as part of specific classes. For example, citations to additional documentation occur at identificationInfo.additionalDocumentation and citations to quality reports occur at dataQualityInfo.standaloneQualityReport.
All ISO Citations include elements of traditional citations to books or papers e.g., title, authors (people or organizations in many roles), dates (many types), series information, page numbers, etc., as well as identifiers (ISSN, ISBN, and other types) and URLs with titles, descriptions and types. The XPaths to these items from each ISO citation root are shown in Table 3.

CodeMeta citations
CodeMeta includes twenty-six terms that represent resources that are related to or support the use of the software being described. These terms have several different types (Text, URL, Text or URL, CreativeWork, CreativeWork or URL, Computer Language or text, . . . ). In the mappings below, these terms are mapped to the ISO citations. The specific types can be described by adding the paths in Table 3 to the concept or XPaths.

Distribution
Many of the distribution systems for geographic data described by ISO metadata include repositories (generally called archives or data centers) that manage and preserve data while providing on-going support for users. ISO metadata standards accommodate approaches to resource distribution with or without descriptions of repositories (termed distributors) and each repository can provide several URLs (transferOptions) for each resource. These onlineResources can have any of the functions included in the CI_OnLineFunctionCode codelist in Table 2. The most common online functions are download and information and these are used in the mappings to indicate direct access to the resource (function =download) or information about the resource (function =information).

Additional documentation
The CodeMeta vocabulary includes many items that are intended to help users use and understand the software described in the metadata. In the ISO standards, these items can be described in two ways: as associated resources (identificationInfo.associatedResource) or as additional documentation (identificationInfo.additionalDocumentation). I have chosen the later in these cases. In dialects without specific citations, e.g., Datacite, these would be referred to as relatedIdentifiers with appropriate relationTypes as the DataCite dialect is soft typed (DataCite Metadata Working Group, 2017).
One important goal of CodeMeta is to enable authors to cite software that is used to store, process, analyze, and visualize the data and model results that they use in their work. Increasing citations from the scientific literature is large part of this goal, but there are also significant opportunities to improve the completeness of dataset metadata by citing software. This is generally done as part of the provenance or lineage section of the metadata. The ISO standards provide several specific resource citations for citing software, including: • resourceLineage.processStep.processingInformation.algorithm.citation, • resourceLineage.processStep.processingInformation.softwareReference, and • resourceLineage.processStep.processingInformation.documentation.

Mappings
The mappings between CodeMeta and ISO are presented here in a series of tables that correspond to the source schema.org schemas used in order to provide some structure that may help clarify the relationships and improve understanding. The process of creating the mappings involved three steps: (1) obvious connections, i.e., description ->identificationInfo.abstract, or name ->identificationInfo.citation.title, (2) more complicated connections like those discussed above, and (3) matching intended types as closely as possible, i.e., CodeMeta terms that were intended to be identifiers were mapped to ISO identifier codes (identifier ->identificationInfo.citation.identifier.code) and those that were intended to be URLs were mapped to ISO linkages (downloadUrl ->distributionInfo.transferOptions.onLine[function ='download'].linkage). This process is, of course, more subjective than objective, and the resulting mappings reflect experience authoring the ISO standards and working with them in multiple contexts over the last decade.
The mappings include the property names, types, and descriptions from the CodeMeta vocabulary, conceptual paths for the ISO items (ISO 19115-1, 2014), and XPaths from the standard XML representation (ISO 19115-1, 2016). The conceptual paths are provided here in lieu of the ISO definitions for simplicity. The complete ISO conceptual model with definitions is available in an HTML view (ISO Conceptual Model, 2018). In some cases, multiple CodeMeta terms are mapped to single ISO elements, as described in the Additional Documentation section above. These cases are marked with * in the tables.

Schema:Person
The schema.Person schema provides a vocabulary for properties of people. In the ISO standards, people and organizations are both referred to as parties and names can be given as any combination of individual names, organization names, or positions. This mapping includes seven items listed in Table 4.

Schema:Thing
The schema.Thing schema provides a vocabulary for properties of the most generic type of item. In the context of CodeMeta, this item is the resource described by the metadata which is software. In ISO 19115-1 (2014), properties related to the identification of the resource being described are in the identificationInfo section and many of the properties are included in the citation to that resource. As described above, these properties (title, identifier, and link) are included in all citations in the ISO model. This mapping includes six items listed in Table 5.

Schema:Thing.CreativeWork
The Thing.CreativeWork schema provides a vocabulary for the most generic kind of creative work, including books, movies, photographs, software programs, etc. This mapping includes twenty-four items listed in Table 6.

Schema:Thing.CreativeWork.SoftwareSourceCode
The Thing.CreativeWork.SoftwareSourceCode schema provides a vocabulary for describing computer programming source code. This mapping includes four items listed in Table 7.

CodeMeta:SoftwareSourceCode
The CodeMeta:SoftwareSourceCode schema extends Thing.CreativeWork.SoftwareSourceCode with terms created by the CodeMeta Project. This mapping includes ten items listed in Table 8.   The Thing.CreativeWork.SoftwareApplication schema provides a vocabulary for describing a software application. This mapping includes fifteen items listed in Table 9.

CONCLUSIONS
The ISO metadata standards were originally developed by ISO Technical Committee 211 to serve as the standard and structured part of the documentation needed to discover, access, use, and understand datasets. The standards acknowledge that they are generic, and they include several mechanisms for extension to address specific needs of communities that build on the standards. The generic nature of these standards is reflected in the breadth of the codelist that can be used to describe the scope of a particular metadata record (see list in Model Characteristics Section above and Habermann, 2018a;Habermann, 2018b;Habermann, 2018c).
The CodeMeta project recently proposed over sixty terms that can be used in metadata for software. This recommendation provides a framework that provides insight into what ISO metadata for software might contain. The ISO metadata standards include several elements that directly cite software, e.g., the processingInformation.softwareReference or algorithm.citation elements, and the primary purpose of the mappings proposed here is to support ISO users that want to (1) express software metadata in a dialect that they are familiar with and (2) to facilitate translation of software metadata written using ISO standards to CodeMeta-compliant JSON-LD.
Note that the purpose is different than the primary purpose of the crosswalks proposed on the CodeMeta project site which is to facilitate automated translations between different JSON-LD vocabularies and RDF. Adding an ISO crosswalk to the CodeMeta framework was the original goal of this work, but the differences identified and described here are significant enough to make that impossible. This translation may be facilitated with an XSLT built specifically for that purpose, like the one created by Peroni, Lapeyre & Shotton, 2012, but that is beyond the scope of this initial comparison.
The process of creating a mapping between these two representations surfaced some differences that complicate the mapping. Some of these differences are related to hard and soft typing used in the two models and others are related to increased flexibility that is required in a generic standard like ISO for documenting citations, distribution channels, and related resources.
The differences in approach described here probably apply to many mappings from XML to RDF representations. Peroni, Lapeyre & Shotton, 2012 identified and discussed some similar challenges when mapping between JATS XML and SPAR Ontologies. They attributed these differences to ''differing philosophical viewpoints for XML and RDF'', then described two reasons that XML element names are ambiguous when used in isolation: attributes and hierarchical structure. They suggest that the JATS standard (and by implication all XML representations) is ''deliberately vague'' and that the hierarchical structure of XML is ''not formalized and implicitly lives outside the XML schema of the language''. It is certainly true that XML element names can be ambiguous without associated attributes and hierarchical structure but ignoring those methods of providing meaning in XML and then stating that elements alone are ambiguous does not contribute to understanding differences between how these two approaches describe resources.
I examine these differences in more detail and offer an explanation that is related to different approaches to defining types: XML can use attributes and structure to provide information about types and meaning while RDF must rely on unambiguous and shared definitions of terms. To use the case described in Peroni, Lapeyre & Shotton, 2012, the JATS element ''article'' is definitely ambiguous. That is why the definition of the element includes an attribute called article-type to clarify the type of the article being described. Details of the importance of attributes for semantics in XML is described in detail by Seligy (2018) along with potential problems that result from ignoring attribute values. Peroni, Lapeyre & Shotton (2012) also acknowledge the requirement for unambiguous definitions that are shared across communities. They state that ''A cornerstone of the Semantic Web is the use of open published ontologies to give precise and universally available definitions to terms, so that RDF statements, whatever else they are, are unambiguous in their meaning.'' While this may be a goal of ontology development efforts, many of the definitions currently used in CodeMeta and schema.org (shown in Tables 5-9) are unclear, non-unique, and ambiguous.
ISO mappings are proposed for sixty-four of the sixty-eight CodeMeta V2 terms. These mappings can be used to create CodeMeta-compliant metadata from existing stores of ISO metadata and to add CodeMeta compliant software citations in the future. This compares to an average of 11.2 mappings for other dialects included in the CodeMeta crosswalk. This disparity reflects the use cases targeted by the dialects. Many of the dialects that have been mapped to CodeMeta are focused on citation or dependency identification and management while ISO and CodeMeta share additional targets that include access, use, and understanding.