Knowledge graph construction and application in geosciences: A review

: Knowledge graph (KG) is a topic of great interests to geoscientists as it can be deployed throughout the data life cycle in data-intensive geoscience studies. Nevertheless, comparing with the large amounts of publications on machine learning applications in geosciences, summaries and reviews of geoscience KGs are still limited. The aim of this paper is to present a comprehensive review of KG construction and implementation in geosciences. It consists of four major parts: 1) concepts relevant to KG and approaches for KG construction, 2) KG application in data collection, curation, and service, 3) KG application in data analysis, and 4) challenges and trends of geoscience KG creation and application in the near future. For each of the first three parts, a list of concepts, exemplar studies, and best practices are summarized. Those summaries are synthesized together in the challenge and trend analyses. As artificial intelligence and data science are thriving in geosciences, we hope this review of geoscience KGs can be of value to practitioners in data-intensive geoscience studies.


Introduction
Artificial intelligence (AI) has received increasing attention in geosciences in the past decade . In particular, for data-intensive geosciences there has been a significant growth of machine learning (ML) and deep learning (DL) applications in recent years (Lary et al., 2016;Bergen et al. 2019;Karpatne et al. 2019;Reichstein et al. 2019). Besides ML and DL, knowledge engineering, logic, and reasoning are also essential topics in AI (Russell and Norvig, 2021), among which the knowledge graph (KG) rises as a unique subject. A KG is a graphical representation of structured knowledge from the real world, in which the nodes represent entities of interest and the edges represent relationships between those entities Hogan et al., 2020). In a data life cycle (Wing, 2019), the associated works of KG connect the upstream work of knowledge engineering and representation, the midstream work of data curation and integration, and the downstream work of data analysis and result communication. Very recently, Gutierrez and Sequeda (2021) reviewed the interweaving of data and knowledge since the advent of modern computing in the 1950s, to reveal the historical roots of the KGs in nowadays. They suggested that both statistical and logical methods contribute to the convergent work of data science, and the next-generation scientists should be aware of the KG developments in addition to the overwhelming ML and DL studies. As a reflection, earlier publications in geoinformatics and geomathematics have also addressed the importance of machine-readable knowledge models in the cyberinfrastructure (e.g., Loudon, 2000Loudon, , 2009 and the flexible application of data-driven and knowledge-driven approaches in data analysis (e.g., Bonham-Carter, 1994;Carranza, 2009). However, comparing with the many recent review papers on ML and DL in geosciences, there is a shortage of summary and review of KGs in geosciences. Although there has been some progress in geoscience KG construction and application in the past decades, the entrance barrier to KG still seems high to many geoscientists, especially newcomers.
The history of KG can be traced back to ancient people's idea of representing knowledge in a diagrammatic form (Gutierrez and Sequeda, 2021). The Google Knowledge Graph released in 2012, together with similar ideas at Microsoft, Facebook, eBay, and IBM, significantly increased the visibility of KG as an AI approach to researchers and the general public (Noy et al., 2019b). Yet, for KG practitioners in geosciences, it is necessary to realize that KG is rooted in several other areas in computer science. At the 2019 U.S. Semantic Technologies Symposium (Durham, NC), there was an active discussion on the statement that "In the 1990s, we talked about vocabularies; in the 2000s, we talked about ontologies; and in the 2010s, we began to talk about knowledge graphs." There have been several initiatives on building vocabularies, ontologies and KGs in geosciences and applying them in research. The Commission for Geoinformation within the International Union of Geological Sciences (IUGS-CGI) is a facilitator of standardized geoscience vocabularies and schemas for geologic data (Asch and Jackson, 2006). Part of the IUGS-CGI outputs were adapted in the OneGeology, OneGeology-Europe and the INSPIRE programs to harmonize geologic data from distributed sources (Laxton, 2017). Federal agencies in U.S. such as USGS and NASA have also invested efforts on KGs for geoscience data management and analysis (e.g., Zhang et al., 2016;USGS NCGMP, 2020). The EarthCube, an NSF initiated program, has led to many recent progresses on geoscience vocabularies, ontologies and KGs (e.g., Richard et al., 2014;Gupta et al., 2015;Zhou et al., 2020). Two recent reports released by the World Wide Web Consortium (W3C) summarized the best practices for publishing data on the Web (Loscio et al., 2017;Tandy et al., 2017). Those best practices show a clear trend that KGs will take an essential role for better data services on the Web. It is also encouraging to see that a few examples from geosciences were included in the two reports.
Geoscience KG is an interdisciplinary subject. Despite those above-mentioned progresses of KG in geosciences, the gap between geoscience and computer science still makes it hard for many real-world practitioners to see a roadmap to incorporate KGs into data-intensive geoscience research. Semantic technologies (Berners-Lee et al., 2001;Bizer et al., 2011) are a key topic of KG in existing studies. Narock and Wimmer (2017) conducted a bibliometric analysis of semantic technologies with literature from the American Geophysical Union (AGU) Fall Meetings (i.e., a representative geoscience conference) and the International Semantic Web Conference (ISWC) series (i.e., a representative computer science conference). Their results show that the overlap between AGU and ISWC is minimal. While computer scientists focus more on the precision of their algorithms and the efficiency in big data processing, geoscientists and geoinformaticians focus on the actual improvement enabled by semantic technologies in their geoscience work (cf. Hogan 2020; Hitzler 2021). Comparing with the KG construction and application in biology and biomedical studies (e.g., Ashburner et al., 2000;Gene Ontology Consortium, 2019;Nicholson and Greene, 2020), most existing geoscience KGs focus on lightweight semantics, and their applications are limited to data harmonization and integration. Computer scientists can see the potential of deeper applications of KGs in geosciences, but geoscientists would like to see a list of KG technologies that can guide them from simple to sophisticated applications ( 4D Initiative, 2018;Gil et al., 2019;NASEM, 2020;Wang et al., 2021).
The purpose of this paper is to review the existing work of KGs in geosciences, summarize the best practices, and discuss the trends of KG construction and application. The remainder of the paper is organized as follows. Section 2 summarizes the concepts associated with KG and ways to construct a KG in geosciences. Section 3 focuses the progress of KG applications in geoscience data collection, curation, and service. Section 4 summarizes KG applications in geoscience data analysis, including topics of data mining processes, social media and literature data, image analysis, vector data, and integrated applications. Section 5 discusses the trends in the near future. Finally, Section 6 concludes the paper. We hope this review will be beneficial to many geoscientists who would like to deploy KGs in their data-intensive studies.

Knowledge graph construction: associated concepts and approaches
A KG, in its broad sense, can be envisioned as a group of nodes connected by edges, where the nodes represent entities in the real world and edges for the relationships between those entities. This is a good way to lower the barrier of entrance for geoscientists to work on KG. However, it is important to note that a graphic conceptual map is just the beginning stage. A more functional part of KG is the logical assertations we can add to the nodes and edges and the capability of reasoning and inference enabled by them.

A spectrum of knowledge graphs
As introduced in Hogan et al. (2020), Abu-Salih (2021), and Gutierrez and Sequeda (2021), the work on KGs in AI has close relationship to scientific advancements in Semantic Web, databases, knowledge engineering, natural language processing, and ML. In the past decades, the approach of an ontology spectrum (Welty, 2002;McGuinness, 2003;Obrst, 2003;Uschold and Gruninger, 2004) has established a roadmap for many researchers to build vocabularies, schemas, and ontologies to meet the needs of various applications. Intuitively, we can adapt that approach to establish a KG spectrum (Figure 1) to guide KG construction in geosciences.
For all the KG types in Figure 1, there are existing examples in geosciences. Here we will give an intercomparison about the characteristics of those types by using those real-world examples. Catalog and glossary are often seen at the end of a book. They are normally an alphanumerical list of keywords for the content of the book. In some glossaries, each keyword is appended by all the page numbers where the keyword appears, which offer readers a quick overview about the major subjects of a book. Some glossaries are also published indecently, such as the Glossary of Geology (Neuendorf et al., 2011). Taxonomy is the classification of concepts, which often shows a supergroup-subgroup structure. For example, paleobiologists use the taxonomy of domain, kingdom, phylum, class, order, family, genus, and species in the classification of life. In the geologic time scale, there is a hierarchal structure of eon, era, period, epoch and age. The periodic table arranges chemical elements by their atomic number and electron configuration, and demonstrates the periodic trends in the rows and columns of the table. Thesaurus, sometimes called controlled vocabulary, is like a mixture of glossary and taxonomy, in which the terminology is organized within a hierarchy. The Glossary of Geology (Neuendorf et al., 2011), although organized in an alphabetical structure, shows such taxonomical information in the annotation of some terms. There are more typical examples of geoscience thesaurus (e.g. AQSIQ, 1988, Rassam et al., 1988Gravesteijn et al., 1995;CCOP and CIFEG, 2006), and an interesting pattern of them is the inclusion of multilingual labels. Recently, many thesauri (e.g. Caracciolo et al., 2013;Stevens, 2019) were also encoded with semantic technologies, such as the Simple Knowledge Organization System (SKOS) (Miles and Bechhofer, 2009).
Conceptual schemas, also called conceptual models, are often seen in the design of data structures for relational databases. Sometimes there will be formal relationship of superclass-subclass for two entities in a schema, where a subclass inherits all the properties of the superclass. The Unified Modeling Language (UML) is widely used in the design of conceptual schemas. A good example is the conceptual model for the geologic maps in North America (NADM Steering Committee, 2004). There were also conceptual schemas designed for data exchange on the Internet, such as GeoSciML (Sen and Duffy, 2005). The INSPIRE program, a pan-European spatial data infrastructure, is developing data and metadata schemas for 34 subjects in Earth and environmental sciences, with the full implementation aimed by 2021 (Bartha and Kocsis, 2011). Ontology with formal logical assertions is the last type on the KG spectrum ( Figure 1). Each ontology is the formal specification of a shared conceptualization of a domain (Gruber, 1995). Semantic technologies such as Resource Description Framework (RDF) (Klyne and Carroll, 2004) and Web Ontology Language (OWL) (McGuinness and van Harmelen, 2004) are widely used to add logical assertations on classes and properties in an ontology, such as disjoint classes, equivalent classes, transitive properties, and more. A well-known ontology in Earth and environmental sciences is SWEET (Raskin and Pan, 2005). There are also ontologies built for themed geoscience subjects, such as geologic time (Cox and Richard, 2015), hydrology (Brodaric et al., 2019), hydrogeology (Tripathi and Babaie, 2008), structural geology (Babaie et al., 2006), fractures (Zhong et al., 2009), and sensor networks , just to name a few.
As reflected by the spectrum in Figure 1, A KG in the real-world geoscience applications is often seen as a mixture of TBox and ABox. The former is the classes and properties representing a domain (cf. the right part of Figure 1), and the latter is the instances of those classes (cf. the left part of Figure 1). To which level should we detail the semantics of a KG is decided by the needs of research activities.

How to build knowledge graphs
KG construction is an engineering process, and is an iterative process where many methods and tools can be applied (Fox and McGuinness, 2008). The existing approaches can be grouped in two clusters: top-down and bottom-up. The top-down approach stems from the modeling process in database construction ( Figure  2). First, a subject domain and a list of research needs are identified. Second, a conceptual model will be designed to collect the entities of interest, their inter-relationships, and the categories. A useful tool for conceptual modeling is the CmapTools (Cmap, 2021). Third, the logical and physical models will add logical representation and assertions to the collected entities and relationships. Fourth, the technical development and implementation need to consider the coding language to use (e.g. RDF and OWL), the serialization formats (e.g. RDF/XML, Turtle, and JSON-LD), and the KG development platforms such as Protégé (Tudorache et al., 2008) and DOGMA (Spyns, 2008). The last step is to deploy the KG as a service to allow the community reuse and provide feedback. In general, this is a process to transform the knowledge in the domain experts' brain to a machine-readable representation. Many existing geoscience KGs were constructed through this approach, such as the schema for mineral classification (Garvie, 1995), the SWEET ontology (Raskin and Pan, 2005), the GeoCore ontology (Garcia et al., 2020), and the other examples mentioned in Section 2.1. Recently, the Deep-time Digital Earth (DDE) Big Science Program of the International Union of Geological Sciences built its own platform for building and serving KGs (Shi et al., 2020;Wang et al., 2021). KG practitioners can also refer to summaries and reviews of KG development tools (e.g., Corcho et al., 2003;Slimani, 2015;W3C, 2015) to find a good match to their work. The bottom-up approach of KG construction is based on crowed-sources data, such as social media and the literature legacy. Earlier discussions include mining Web content to build knowledge bases (Craven et al., 2000) and use an observation-driven approach in geo-ontology engineering (Janowicz, 2012). The thriving social media and open access to published literature further extend the scope of data sources to be used in KG construction. The number of publications following this bottom-up approach has increased significantly in recent years. For example, Gao et al. (2017) used Hadoop to process geotagged data in Flickr and successfully built gazetteers in geography. Zhu et al. (2017), Wang et al. (2018b) and Fan et al. (2020) used natural language processing (NLP) and text mining to process geoscience literature (reports, books, and journal papers, etc.) and then use the results to guide the process of KG construction. Although the bottomup approach is able to process a large number of datasets and quickly build a big KG, a remaining challenge is the precise logical representation and assertations for the entities and relationships in the resulting KG. Very often, they still need to be specified by the domain experts and knowledge engineers, where existing KGs can be reused.

Best practices in knowledge graph construction
Researchers have summarized workflows and recommendations for KG construction, and some of them are based on examples from geosciences (Fox and McGuinness, 2008;Kendall and McGuinness, 2019). In particular, they highlighted a use case-driven iterative approach to leverage existing resources and improve the usability of the resulting KG. Figure 3 put together those recommendations together with the approaches discussed in Sections 2.1 and 2.2 to present a suggested workflow for building and applying KGs in geosciences. Each use case has a specific topic relevant to the domain, such as discovering datasets with one or a few keywords, recommending algorithms to analyze a certain type of data, and finding researchers who share the same research interests. Domain experts (e.g., geoscientists) will work together with knowledge engineers to analyze each use case to get a draft list of entities, relationships, categories, and structures. If necessary, the bottom-up approach can also be used to augment the list. Based on the first one or two use cases, a KG prototype can be established and tested. Then more use cases will be analyzed in an iterative process to enrich the KG. In this process, some ontology design patterns (Gangemi, 2005;Gangemi and Presuitti, 2009;Blomqvist et al., 2016) can be reused and adapted from community standards (e.g., the mineral classification chart, the nomenclature of petrology, and the geologic time scale) as well as existing ontologies and vocabularies (e.g., the SWEET ontology). Ontology design patterns are distinctive and repetitive invariants across the various models, data and processes of a domain. Reusing them will improve the interoperability and usability of the resulting KG. There is a 3C (Correct, Consistent, and Complete) guideline (Asch and Jackson, 2006) to determine an appropriate termination point for the use case analyses. The practitioners need to verify that the entities and relationships collected in the KG are correctly defined and annotated, and they are organized in a consistent structure. Moreover, the established entity and relationship lists and the logical assertations are complete enough to address the subject areas and research questions proposed in the beginning of the whole work.
Once a relatively stable version of the KG is generated, a service can be set up for it, either through an individual server or a community portal (right part of Figure 3). As workflow platforms such as Jupyter (Jupyter, 2021) and RMarkdown (RStudio, 2021) are increasingly used by geoscientists in nowadays for data-driven discoveries, for the KG service it is a good practice to develop a Python or R package as the interface to access the KG server. Then users can use the KG from workflow platforms together with many other data and model resources in the open science world. They can also provide feedback to the KG developers. As the FAIR (findable, accessible, interoperable, and reusable) data principles (Wilkinson et al., 2016) are widely accepted in the open data endeavors of various disciplines, there were also discussions on how to build FAIR KGs. For example, Cox et al. (2020) proposed "Ten Simple Rules" towards FAIR vocabularies: 1) Verify the license for repurposing a legacy vocabulary; 2) Determine the governance model and custodian for the legacy vocabulary; 3) Check minimal term definition completeness; 4) Select a domain and service for the Web identifiers; 5) Design a pattern for the identifier scheme; 6) Reuse semantic standards for the vocabulary to increase its interoperability; 7) Add rich metadata to increase reusability; 8) Register the vocabulary to increase findability; 9) Make the Web identifiers resolvable to increase accessibility; and 10) Implement a mechanism for maintaining the FAIR vocabulary.

Knowledge graphs in geoscience data collection, curation and service
Geoscientists have realized the importance of using machine-readable standards in data collection and management since the 1950s when they began to use digital computers. Many publications have discussed topics associated with KG, such as consensus on data models (Dillion, 1964;Hubaux, 1970Hubaux, , 1972Hubaux, , 1973, semantic symbols and nets (Dixon, 1970;Garvie, 1995), controlled vocabularies (Rassam and Gravesteijn, 1982;Shimomura, 1989), rules for spatial data manipulation (Buttenfeld and McMaster, 1991;Chung and Fabbri, 1993), and more. Now, in the era of the Internet and Web, KG still takes an essential role in geoscience data management, and there are new progresses on applying KGs for open and FAIR data.

Knowledge graphs and FAIR data
While almost all geoscientists are using computers in their work, many people are spending about 80% of their time on data preparation before analysis (i.e., the 80/20 rule) (Press, 2016;Mons, 2018;Fox, 2019). The FAIR data principles (Wilkinson et al., 2016) emphasize the machine-readability and machine-actionability of data, i.e., improving the capacity of computer systems to find, access, interoperate, and reuse data. In that way, the manual intervention and operation from human scientists will be reduced to the minimum and, thus, to mitigate or even reverse the 80/20 rule. The FAIR principles have been well received by researchers in various disciplines in the past five years. In particular, the geoscience communities have not only showed the support but also analyzed the challenges and drafted action items towards FAIR data in geosciences (Stall et al., 2018(Stall et al., , 2019. Here we would like to address the close relationship between the FAIR principles and the theories and technologies of KG (Table 1). The findability and accessibility rely on the cyberinfrastructure for persistent and stable identifiers and the protocols and interfaces to resolve those identifiers and retrieve the metadata associated with them. Most of the principles under those two themes have light to medium relevance to KG. In comparison, Most items under interoperability and reusability can be directly supported by KGs (Mons, 2018;Guizzardi, 2020). The FAIR principles can also be compared to the Five-Star Open Data scheme proposed by Berners-Lee (2009). Hasnain and Rebholz-Schuhmann (2018) conducted a detailed mapping between the FAIR principles and the Five-Star scheme, and showed that they share topics on identifiers, metadata, vocabularies and community standards. Although the FAIR principles were recently proposed, there have been many earlier efforts working on various items covered in the principles, and some of them highlighted the use of KGs. For example, in the Virtual Solar-Terrestrial Observatory (Fox et al., 2009), a set of OWL-based ontologies were developed to represent the concepts, relationships and attributes in the fields of solar physics, space physics and solarterrestrial physics. The ontologies were then used to reconcile distributed and heterogeneous datasets and present them to the end users in an organized form. In the OneGeology map data portal (Jackson, 2007), a common geologic data schema GeoSciML (Sen and Duffy, 2005) was used to mediate distributed map services from more than one hundred countries across the world. In OneGeology-Europe (Laxton, 2017), multilingual vocabularies were developed for rock age and type, and were used to support federated data queries sent to map services in different languages. In the EarthCube Geolink project (Krisnadhi et al., 2015;Cheatham et al., 2018), the method of ontology design patterns was used to develop a modular ontology to support data integration from seven geoscience data repositories. The Google Dataset Search was released in 2018. It is based on Schema.org, which provides metadata schemas to markup datasets shared on the Web (Noy et al., 2019a). Numerous geoscience datasets can already be discovered on the Google Dataset Search. Researchers in the EarthCube GeoCODES project have been conducting more case studies to adapt and extend Schema.org, with the aim to build best practices to enable cross-domain discovery and access to geoscience data and research tools (Shepherd et al., 2019). Another interesting work is using ontologies to represent the FAIR principles and evaluate the FAIRness of open data. Examples can be seen in Alowairdhi and Ma (2019) and Brewster et al. (2020).

Knowledge as a service in open data and open science
When the KGs of a domain are established, one way to continue their maintenance and populate their application is to build a service for them on the Internet and Web. For example, in the field of biology and biomedical studies, the BioPortal provides Web services to various ontologies, which can be used to drive data integration, information retrieval, data annotation, natural language processing, and decision making (Noy et al., 2009;Wetzel et al., 2011). The Web-based concept browsing and graph visualization allow users quickly see the landscape of a subject domain of interest, while the logical assertions and rules in the KGs can be used in the data integration and analysis processes. The geoscience communities have also taken initiatives to build similar services. For instance, NASA is leading the maintenance and service of the SWEET ontology (Raskin and Pan, 2005) and the GCMD keywords (Stevens, 2019). The former is a foundational ontology that covers more than 200 subject areas and over 6,000 concepts in Earth and environmental sciences. The latter is a hierarchical set of controlled vocabularies covering 14 categories of keywords in Earth science, and has been used in NASA's Earth Observing System Data and Information System (EOSDIS). USGS has been developing and maintaining thesauri in the past two decades with semantic technologies. The current USGS thesaurus service (USGS, 2021) hosts a long list of controlled vocabularies that provide category terms for data and information products of USGS. IUGS-CGI has also built a website to host the services of the geoscience schemas and vocabularies built by its international working groups (IUGS-CGI, 2021). Researchers have also discussed methods for building service structures of geoscience KGs and best practices (Cox and Richard, 2015;Cox et al., 2020;Ma et al., 2020). Very recently, the Semantic Technologies Committee of the Federation of Earth Science Information Partners (ESIP) has established a community ontology repository (COR) (ESIP, 2021) to host KGs from the geoscience communities, coordinate collaboration, and promote best practices.
A recent topic of high interest among the geoinformatics community is Knowledge as a Service (KaaS). Besides the service capabilities mentioned in the above paragraph, another key advantage of KaaS is to provide context information for data and data science processes. A key work in the Semantic Web community, the Provenance Ontology (PROV-O) (Lebo et al., 2013), has been widely applied in the past years to enable the documentation of context information. Provenance literally means the origin of something. In data science it means to chain up scientific results and findings with the various data, methods, platforms, instruments, people, organizations involved in a research (Groth et al., 2012). For example, in the Global Change Information System (GCIS) of the U.S. Global Change Research Program, a PROV-O-based GCIS ontology was built to capture the provenance of global change research. The collected information was published on the GCIS portal (Tilmes et al., 2013;Ma et al., 2014b). In the work on Essential Climate Variables in Europe, approaches similar to GCIS have also been taken to enable traceability of scientific results (Zeng et al., 2019). The granularity of provenance can go even deeper to steps in algorithms and data analytics workflows. For instance, The METACLIP R package developed by Bedia et al. (2019) was able to capture the detailed steps in an R workflow (e.g., raw data input, derived data, packages import, functions, and variables, etc.) that leads to a resulting image. In the work of Stasch et al. (2014), KGs were used to suggest appropriate steps in spatial statistics for certain structures and patterns in the input data. An increasingly discussed topic in computer science of nowadays is explainable AI (Hagras, 2018;Lundberg et al., 2020). Provenance, semantic technologies, and KGs will make solid contributions to that field of work (cf. Goebel et al., 2018;Palmonari and Minervini, 2020).

Best practices of applying knowledge graphs for data curation in the data ecosystem
Researchers have argued that the power of machine learning and big data processing does not mean we can simply dump all the digital records without any structure and order and rely on machine to find patterns out of the chaos -If the data is the train, then semantics will be the rail . An essential goal of the Web is to promote interconnection, interaction, and intercreation among different people, resources, and facilities (Berners-Lee and Fischetti, 2000). Now, the open data and open science activities have created a data ecosystem on the Internet and Web (Berman, 2008;Wing, 2019). This is a socio-technical system of many interacting factors. The technical part covers many topics relevant to data collection, curation, distribution, analysis, and communication. The social part covers topics of data privacy, license, ethics in data access and reuse, citation guidelines, feedback from data consumers, trustworthiness, informed decision making, and more. Appropriate handling of those issues will help establish a virtuous cycle in the data ecosystem to facilitate data-driven science. The W3C community have summarized a list of best practices about the publication and application of data on the Web and their benefits to the data ecosystem (Loscio et al., 2017). Table 2 puts the list together with the FAIR data principles and shows the relevance of each best practice to KGs. As reflected in the table, those items have strong relevance to KGs: metadata and annotation, provenance of data source and origin, standards and vocabularies, and data structure and formats. In particular, for data on the Web, vocabularies, models and ontologies enabled by semantic technologies will be a big advantage to increase machine accessibility and readability. We currently mark a light relevance between KGs and data identifiers. However, there are many interacting factors in the data ecosystem, such as platforms and instruments, people, organizations, research programs, models and algorithms, software packages and functions, workflows and model-runs, with others. If we want to offer formal definition for the categories and properties of those factors and then assign unique identifiers for all of them, then KGs will also take a fundamental role in that work.

Knowledge graphs in geoscience data analysis
A good way to envision the role of KG in geoscience data management and analysis is to put it in the context of the data-information-knowledge-wisdom (DIKW) model ( Figure 4). Conventionally, people think DIKW is a one-direction process, and the steps of knowledge and wisdom rely more on human experience and decision-making. KGs will complement the DIKW process by encoding human knowledge in machine-readable formats, which can be applied to aid data management and analysis. Section 4 has given a summary of KGs in geoscience data management. This section will focus on KGs in geoscience data analysis. In geoinformatics and geomathematics, researchers have discussed the studies of embedding qualitative AI methods in quantitative data analysis models since decades ago (e.g., Bugaets et al., 1991;Dimitrakopoulos, 1993). Now, the big geoscience data such as literature and crowd-sourced records, remote sensing images, and accumulated digital maps pose both challenges and opportunities for the application of KGs in data analysis.

Figure 4
The role of machine readable knowledge graphs in the data-information-knowledge-wisdom model.

Knowledge graphs and literature and crowd-sourced data analysis
Textural records are a very unique type of big data in geosciences, and they are widely distributed in published literature and the crowd-sourcing data platforms. KGs such as community-level dictionaries and ontologies have been used to aid NLP and text mining in geoscience literature analysis. Typical use cases include: 1) To summarize and visualize the key information of a document in a graph; 2) Inter-comparison of themes and writing patterns of chapters/sections in a long document; 3) Domain-specific gazetteer or corpus construction; and 4) KG augmentation and iterative usage in text mining. Wang et al. (2018b) used community-level standards, including geological dictionaries and terminology classification schemes (AQSIQ, 1988) to build a large corpus, then used it to train word segmentation rules and applied them together for processing geologic reports. The results included word frequency diagrams, word clouds, bigrams showing clusters of key content-words, and chord graphs showing inter-relationships between content words. The results are able to uncover the key subjects and structure of a document, and show the potential of KG augmentation based on multi-document analysis. In Qiu et al. (2020a), spatial and temporal gazetteers were built to support the process of information extraction for literature. The spatial gazetteer included places names and spatial relationships well known in geosciences, and the temporal gazetteers included both geologic time scale and the general temporal expressions in the Gregorian calendar form. In Qiu et al. (2020b), a geoscience dictionary matching step was used to guide the bidirectional long shortterm memory (LSTM) neural network in text classification.
In the field of geoscience literature mining, the work of GeoDeepDive (Zhang et al., 2013;Peters et al., 2017b) is worth a special note. GeoDeepDive is a machine learning package and digital library for discovering data and knowledge from published literature. Many publishers in the field of geosciences, such as Elsevier, Wiley, Taylor & Francis, USGS, the Society for Sedimentary Geology, the Geological Society of America, Canadian Science Publishing, and PubMed have signed agreements to set up full-text access to GeoDeepDive. By March 2021, GeoDeepDive has preprocessed more than 13.4 million documents, and set up interfaces and guidelines to allow other researchers to use the data. Peters et al. (2014) have successfully used GeoDeepDive to extract fossil records and enhance the Paleobiology Database, which in turn has benefited several recent data-driven studies (e.g., Peters et al., 2017a;Muscente et al., 2018). The workflow of GeoDeepDive (Peters et al., 2017b) shows that a good way to rescue dark data from literature is by ingesting a structured vocabulary with specific scientific foci. Then the terms in the vocabulary can be indexed against the preprocessed literature in GeoDeepDive to create a subset of documents for data extraction.
Another type of textual data is collected through the crowd-sourcing mode, such as social media platforms, news reports, and citizen science Web portals. They have been increasingly used in hazard mitigation, public health surveillance in space and time, and other themed geoscience studies. A review of social media data analysis (Ravi and Ravi, 2015) shows that lexica are functional in opinion mining and sentiment analysis. In the context of that paper, a lexicon is a controlled vocabulary of sentiment words with respective sentiment polarity and strength value. Lexica can be used together with ontologies to enable reasoning and inference tasks. A similar technical approach was seen in Wang and Stewart (2015), but on a different scientific topic: hazard information extraction from news reports. In their work, ontologies were used together with natural language gazetteers to improve the quality of hazard event extraction from online news reports. Then, the spatiotemporal patterns (i.e., occurrence and evaluation) of those events were analyzed. In Jayawardhana and Gorsevski (2019), ontologies were used for similarity computation, with the aim to tackle the heterogeneous labels in Tweets and maximize the detection of influenza. Another interesting example of crowd-sourcing data and KG construction and application is Mindat (Mindat, 2021). It is leading web portal on minerals and their localities, deposits and mines worldwide. By March 2021, Mindat has more than 55,000 users and about 6,000 of them have contributor rights. Many Mindat data such as alternative names of mineral species and literal records of localities depend on users with local expertise of a certain region to cleanse and reconcile the records. In the meantime, the Mindat team has applied community standards such as nomenclatures in mineralogy and petrology, taxonomy in paleobiology, and terminology in geologic time, and has set up mappings between community standards and the alternative names.
Mindat has underpinned a large number of data-driven geoscience studies in recent years .

Knowledge graphs and geographic object-based image analysis
The Geographic Object-based Image Analysis (GEOBIA) is a new paradigm for remote sensing image analysis in addition to the conventional "per-pixel paradigm" . Here the image-objects are meaningful entities or scene components that are distinguishable in an image, such as a house, a tree or a vehicle (Blaschke, 2010). Ontologies and semantics are key components in the workflow of GEOBIA as they provide a machine-readable representation of objects in the real world ( Figure 5). Blaschke et al. (2014) addressed that there are no one-fit-all ontology solutions even for the same types of objects in GEOBIA. As reflected in Figure 5, the GEOBIA workflow is normally an iterative process. For the domain of the image-objects, ontologies will be constructed to capture the knowledge of domain experts, and will be used together with a rule set in image analysis. The initially generated image-objects will be classified and enhanced iteratively by applying the ontology and the rule set. In this process, the ontologies can also be extended or updated. Although the focus of Figure 5 is image analysis, the iterative workflow in it can be compared to Figure 3. Another thought is that the KG engineering workflow in Figure 3 can be used to extend the ontology engineering step in GEOBIA.

Figure 5
An overview of the iterative workflow in GEOBIA (adapted from Blaschke et al., 2014).
GEOBIA, the "per-object paradigm", and the methodology of incorporating ontologies and semantics in image analysis have received significantly increasing attention in the past two decades (Liu et al., 2007;Arvor et al., 2013;Blaschke et al., 2014;Gu et al., 2017;Arvor et al., 2019). There have been successful applications of this new paradigm of remote sensing image analysis in many geoscience domains. In Drăguţ and Blaschke (2006), a list of nine classes were built to represent landform elements based on the surface shape and the altitudinal position of objects. The classes were defined using flexible fuzzy membership functions and were successfully used for automated classification of landform elements in two case studies.
To detect and classify off-shore oil slicks, Akar et al. (2011) applied object-based classification with fuzzy membership functions derived from the features of categorized scenes in the ENVISAT Advanced Synthetic Aperture Radar (ASAR) imagery. The parameters of the detection algorithms were tuned for each category to improve the quality of results. In de Bertrand de Beuvron et al. (2013), an ontology was built to represent urban objects and the spatial relationships between them, which came to be a powerful support for object-based image analysis in urban environment studies. Kohli et al. (2012Kohli et al. ( , 2013 built ontologies of slums by using indicators related to the morphology of the built environment, and successfully used them for slum identification from very high resolution imagery (i.e., GeoEye-1). In Belgiu et al. (2014), an ontology was created to represent three classes of building types, and then used in an GEOBIA process to identify buildings extracted from airborne laser scanning data. The Random Forest classifier was applied to select the relevant features for predicting the classes of interest. An interesting finding of their work is using the Random Forest classifier to predict the explanatory power of the input variables (i.e., Variable Importance), which was addressed again in a review article later (Belgiu and Drăguţ, 2016). From our point of view, the Variable Importance can also be used to augment ontology engineering in the iterative GEOBIA process (cf. Janowicz, 2012).

Knowledge graphs and digital map analysis
If remote sensing images are the big raster data, then the digital maps and associated databases are the big vector data. In the domain of cartography and GIScience, the incorporation of semantics and KGs to spatial data service and analysis has been an active research topic for decades (Lüscher et al., 2009;Janowicz et al., 2010;Li et al., 2014;Gould and Mackaness, 2016). Many of them have been mingling with the standards and building blocks established by the Open Geospatial Consortium (OGC) and W3C. In particular, for online spatial data processing, Yue et al. (2007Yue et al. ( , 2011 have done extensive work to establish spatial data processing service chains by integrating semantic technologies and spatial data services. Now, the FAIR data principles (Wilkinson et al., 2016) and the Five-Star Open Data scheme (Berners-Lee, 2009) are driving spatial data to be made open in more structured and interoperable forms. OGC and W3C are also working on more powerful fundamental KGs for spatial data. For example, the GeoSPARQL (Battle and Kolas, 2011) has incorporated spatial topology and the Time Ontology (Cox and Little, 2020) has included temporal topology. Those endeavors together have laid the foundation for more innovative approaches of online spatial data analysis (Varanka and Usery, 2018).
Geologic mapping is a fundamental work in geosciences, and has seen many studies on developing and implementing KGs. When GIS software was first introduced to the work of field geologic mapping in the early 2000s, geoscientists already began to use ontologies to maintain consistent data structure and facilitate interoperability between databases (e.g., Broadaric, 2004;De Donatis and Bruciatelli, 2006). As the digital geologic maps were increasingly shared online, researchers also began to implement ontologies to mediate multi-source geologic map services, such as those produced at different states in US (Lin and Ludäscher, 2003). The OneGeology-Europe project (Laxton, 2017) has done some impressive work on data integration. About 20 European states participated in the project to share national geologic map services, but many of them were originally recorded in their national official language. The project has built multi-lingual vocabularies to mediate across those map services. One interesting function is that a user can write a query with English labels of rock age or type, then the vocabularies can help translate the query into different languages and send them to the corresponding services to retrieve records. When the records from multiple services are returned to the user, they are organized in a consistent form just like they are returned from a single European geologic map service. Using the open geologic map services, researchers were able to incorporate data visualization techniques and other open data and knowledge resources to build themed data analysis functions (e.g., Ma et al., 2012;Ma, 2017;Wang et al., 2018a). Similar to the active discussion in cartography and GIScience, KGs in geologic map service and analysis will be a long-lasting research topic (cf. Mantovani et al., 2020).

Integrated application of knowledge graphs and machine learning
Comparing with KG construction and KGs for geoscience data curation, the application of KGs in geoscience data analysis is still in the early stage, and it is hard to list the best practices. However, we are able to summarize some integrated applications of the above-mentioned technologies. A common question from many geoscientists is how KGs and KG-enabled capabilities could be used to drive new discoveries in geoscience, either on scientific or engineering topics. In particular, geoscientists would like to see platforms and applications that are able to lower the access requirements of semantic and AI technologies to them, such as the Google Dataset Search engine (Noy et al., 2019a) and the Question Answering systems (Höffner et al., 2017). The highlights of a few recent examples from both industry and academia are summarized below.
The interweaving between KGs and machine learning has generated successful applications in the industry. Marr (2019) listed several latest works at Google, Oracle, Facebook, Netflix, Siemens, and described the trends of integrating KGs and machine learning in the field of financial services. For the field of oil and gas exploration, there has been solid progress of using KGs to boost big data processing and aid decision making (Kimbleton and Matson, J., 2018;Sumbal et al., 2017). Specific examples can be seen in the capabilities enabled by IBM. In Guichet et al. (2019), the IBM Watson was used to identify documents relevant to source rock characterization in petroleum exploration. Two types of machine learning algorithms were tested. The first was trained to identify images and charts in literature, and the second was trained to understand the semantic framework of textual records related to source rocks. The two algorithms were applied to extract information from a large number of documents and save the result in a database. Finally, a user interface was built to translate natural language questions into computer queries to the database. The work showed promising performance in finding the most relevant documents. In another work (Bekas and Staar, 2019), a KG was built based on large amounts of geological, physical and geochemical data. Geoscientists then were able to use the KG to contextualize questions and retrieve relevant information. The work was useful in the identification and verification of alternative exploration scenarios, and can help geoscientists to improve decision making.
Putting those examples from industry together with the progress mentioned in above sections, we can see the application of KG in data analysis is often an iterative approach of dual benefits (cf. Ristoski and Paulheim, 2016). KGs can be used to improve data analysis workflows, and in turn KGs themselves can also be extended and enhanced when more patterns and information are discovered in data analysis. Recent work on mineral evolution resonates with this approach. Mineral evolution is the study of mineral diversity and distribution through the Earth's long history (Hazen, 2010). Abductive (i.e., exploratory), deductive (i.e., knowledge-driven), and inductive (i.e., data-driven) approaches have all been used in recent studies of this field (Hazen, 2014. A typical example that demonstrates the dual benefits to both KG and data analysis is the natural kind clustering of mineral species. This is a subfield of mineral evolution with the aim to amplify the current mineral taxonomy. The present mineral classification system is based on idealized major element chemistry and crystal structure, which lacks consideration on time and cannot reflect planetary evolution or formational conditions (Hazen, 2019a,b;Cleland et al., 2020). Natural kind clustering relies on the many attributes of mineral samples to relate each sample to its paragenesis and thereby develop a scheme for classifying the origin of mineral samples when their context is unknown. Two recent studies of natural kind clustering have demonstrated impressive results. The first is classifying formational environments of pyrite based on geochemical information , and the second is analyzing the presolar silicon carbide grains (Boujibar et al., 2020).

A vision for geoscience knowledge graphs in the near future
With data science thriving in geosciences, we anticipate more KGs will be built and implemented. Several recent review and survey articles (Noy et al., 2019b;Hogan et al., 2020;Abu-Salih, 2021;Gutierrez and Sequeda, 2021) have discussed the challenges that KG practitioners face, which are synthesized below: • KG entity disambiguation and identification, and quality measure: Synonyms, homonyms, entity types are still active research topics, especially for KG construction from un-structured literature.
To sustain KGs in the cyberinfrastructure, the unique, persistent and Web-resolvable identifier of each entity needs more coordination among different communities. A system of metrics is also needed to measure the quality and usability of KGs. • Semantic enrichment and reasoning capability: KGs and data are increasingly bound together. A topic worth attention in KGs is the granularity of semantics in the definition and annotation of entities and relationships, as well as how it will address the needs of data curation. Another topic is the reasoning capability enabled by the logic assertions in KGs, which will be necessary to further leverage KG usage in data analysis. • KG evolution and versioning: Our knowledge is evolving with the progress of scientific discoveries and new understanding of the world. Also, there will be new encoding languages for KGs as well as new KG management systems. Method and technologies are needed to organize KG evolution and versioning, and provide KG as a stable service in the cyberinfrastructure. • Interconnection among KGs and scaling up in big data applications: The works on KG construction and application are scaling up, and interconnection will be needed between high-level and domainspecific KGs, as well as between KGs of different domains and subjects. Multilingualism is another topic to be addressed when KGs are scaled up and used together with big data analysis. • Security, privacy and ethics: Similar to the community recommendations and best practices in open data and open science, KGs will also need a system of licenses for sharing and reuse. Also needed are the regulations and guidelines for protecting privacy and sensitive information, and recommendations for ethical operation of KGs. Sections 2 to 4 in this paper summarized the progress of KG construction and application in geosciences. By incorporating the best practices and exemplar studies from them, this section will discuss the trends of geoscience KG in the next decade and present a few suggestions for practitioners to address the challenges listed above.

Knowledge graph creation and curation in geosciences
An appropriate workflow for ontology engineering in geosciences in a mixture of the bottom-up and topdown approaches through a use case-driven, iterative process ( Figure 3). The bottom-up approach can benefit from the powerful NLP and text mining technologies and the large amounts of accumulated literature legacy and crowd-sourced data. The patterns discovered through big data analysis may reflect interesting rules that are outside the existing human expertise. The top-down approach can bring together researchers sharing the same research interests and leverage existing community standards and ontology patterns. Geoscientists' verification and control can improve the quality and precision of the outcomes from the bottomup approaches. The adaptation of community standards and ontology patterns can reduce inconsistency and duplicated efforts in the resulting KGs. The use-case driven, iterative process has been proven efficient for facilitating the collaboration between geoscientists and data scientists, as well as increasing the usability of the resulting KGs. The 3C (Correct, Consistent, and Complete) guideline (Asch and Jackson, 2006) and the Ten Simple Rules (Cox et al., 2020) for KG construction were proposed by researchers in the field of geoinformatics, and are applicable to many geoscience topics.
Geoscience KG evolution and curation will need more attention. New entities and relationships can appear in a field of study as our understanding deepens. Also possible is the update and revision to existing definitions and descriptions, as well as the inter-mapping between KGs. Technical approaches are needed to tackle those different situations and take actions to update the KG at different levels, such as numeric and literal attributes, instance records, data properties, object properties, classes, and even the whole KG. The situation can be more complicated as KGs are increasingly bound with steps in the data life cycle (Ma et al., 2014a;BDIWG-NITRD, 2018), such as standardizing the structure of databases and terminology of records, annotating data products, providing precise results in data search and discovery, and enabling innovative operations in data analysis. The ultimate goal is that the updated KGs will benefit the data life cycle, but will that require extra work to update the data and the steps mentioned above? One possible way is to use persistent and resolvable Web identifiers for different types of records in a KG and archive detailed versioning history of any updates. When the content of that KG is used, the identifiers and version codes can be cited.
Community of practice remains an effective way to facilitate the creation, evolution, and curation of geoscience KGs. W3C and OGC have had successful collaborations on large KGs relevant to geosciences, such as GeoSPARQL (Battle and Kolas, 2011) and the Semantic Sensor Network ontology (SSN) (Compoton et al., 2012). The Federation of Earth Science Information Partners (ESIP) has created a Community Ontology Repository (COR) (ESIP, 2021) to host many KGs from the geoscience community, such as the SWEET ontology (Raskin and Pan, 2015), the geologic time ontology and vocabularies (Cox and Richard, 2015), the GCMD keywords (Stevens, 2019), and many others. The ESIP Semantic Technologies Committee is also coordinating the revision of a few widely used KGs, such as the SWEET ontology (McGibbney, 2018). The IUGS-CGI is continuously leading the creation of geoscience schemas and vocabularies the coordination of their applications across the world (IUGS-CGI, 2021). The ESIP and IUGS-CGI efforts represent the essential nature of KGs: from the community, by the community, and for the community. Geoscientists in different disciplines have also begun to work with computer scientists to standardize the terminology, data structures, and data formats in their work. A representative example is the PaCTS 1.0 data standard in paleoclimatology, in which both the bottom-up and top-down approaches for KG engineering were applied (Khider et al., 2019). In the United States, the academia, industry, and government are jointly promoting a national Open Knowledge Network, with the aim to establish an open infrastructure that links cross-disciplinary KGs and underpins the cyberinfrastructure ecosystem (Guha and Moore, 2016;BDIWG-NITRD, 2018;Baru, 2018;. In that endeavor, community of practice is recommended for increasing the interoperability and reusability of KGs.

Intelligent geosciences underpinned by knowledge graphs
The thriving AI and data science applications are moving geosciences into the "intelligent" stage (Merriam, 2004;Ma, 2018;Gil et al., 2019). As discussed by both computer scientists and geoscientists (Domingos, 2012;USGS, 2021), data alone are not enough to drive the scientific discovery. Each data mining, predictive analytics, or machine learning process needs to embody some knowledge or assumptions besides the data that are given. The interaction of data and knowledge in the data science process can be explained with the abductive, deductive, and inductive approaches (Tukey, 1977;Ho, 1994;Hazen, 2014). For example, as illustrated in Figure 6, if there is enough knowledge about the requested attributes of each class, then a deductive approach can be the best option to conduct logic inferences. If not, then the data-driven inductive approach can be applied. The abductive approach is another useful approach in the open data environment when a research is based on other people's data. It means to explore the characteristics of the data and generate assumptions or hypotheses for the scientific discovery. Ho (1994) summarized that abduction creates, deduction explicates, and induction verifies. Brodaric (2012) also discussed abduction, deduction, and induction as a virtuous cycle for KG creation and evolution in geosciences.

Figure 6
Inter-comparison of key characteristics of the abductive, deductive, and inductive approaches in data science.
Geoscience KGs need to enrich their embedded semantics to improve the capacity of reasoning, inference, and verification in a data science process. For example, the GeoSPARQL (Battle and Kolas, 2011) defines a vocabulary for representing spatial data on the Web. More importantly, it embeds the spatial topology in its design and is able to describe various relationships between spatial objects (e.g., point, line, and polygon). Based on those, it is able to support both quantitative and qualitative query and spatial reasoning. Similarly, the Time Ontology (Cox and Little, 2020) embeds temporal topology in its design and is able to describe relationships between temporal objects (e.g., instant and interval). They both have been used in many geoscience applications (Ma et al., 2020). For many other subjects in geosciences, such as rock types, mineral species, and fossil species, such kind of detailed semantics are already included in conventional databases, and can be transferred into KGs. Chen et al. (2020) summarized the existing methods of knowledge reasoning into three categories: rule-based reasoning, distributed representation-based reasoning and neural network-based reasoning. They also listed several applications that can be supported by knowledge reasoning, such as KG completion, question answering, and recommender systems. More specifically, Gil et al. (2019) summarized several geoscience research themes that can benefit from knowledge-rich intelligent systems, including model-driven sensing, thrusted information threads, theory-guided learning, and integrative workspaces.
KGs will take active roles in machine learning processes to tackle the challenge of big data. Geosciences are facing a boost of machine learning and deep learning applications (Lary et al., 2016;Bergen et al. 2019;Karpatne et al. 2019;Reichstein et al. 2019), and there is a big potential for deploying KGs in those applications.  discussed three types of knowledge-infused learning, shallow, semi-deep, and deep. The shallow infusion means using KGs to improve the semantics and conceptual processing of data. The semi-deep infusion means congruent integration of KGs in machine learning techniques, and deep infusion means combining the bottom-up statistical intelligence with the top-down symbolic intelligence for hybrid intelligent systems. Hogan et al. (2020) presented similar perspectives, and also pointed out the integrated machine learning processes can also be a way to update, extend, and improve the KGs. A unique topic in those hybrid, integrated processes is using machine learning to analyze knowledge graphs and/or data in graph forms, which has also been incorporated into the workflow of big data processing (e.g., Li and Chen, 2013;Nickel et al., 2015;Martinez-Rodriguez et al., 2020). The perspectives presented by  and Hogan et al. (2020) as well as the recent discussion of AI approaches in GIScience (Li, 2020;Gahegan, 2020) all resonate with the above-mentioned integration of abductive, deductive, and inductive approaches. A few innovative examples of those knowledge-infused intelligent systems have already appeared in geosciences, such as mineral grains recognition (Maitre et al., 2019), rock classification (Ran et al., 2019), petrographic microfacies classification (de Lima et al., 2020), and map service theme classification (Wei et al., 2021). Such systems and applications will significantly increase in the coming years.
KGs are also able to provide support to explainable AI (XAI), which recently has received a lot of attention. In particular, for opaque machine learning processes such as neural networks and genetic algorithms, KGs can help document the provenance of the workflow and improve the interpretability of results. A key feature of KGs is their capability of defining groups or clusters and their associated attributes, which can be leveraged to add a semantic layer to many machine learning algorithms (Lecue, 2020). For example, by explicating typical attributes of instances in a subgroup, KGs can explain the grouping process in a machine learning process and demonstrate the meaning of results (Ristoski and Paulheim, 2016). Geoscientists have used the W3C PROV-O ontology (Lebo et al., 2013) for documenting provenance of data and scientific workflows (e.g., Tilmes et al., 2013;Bedia et al., 2019). Those studies share common topics with XAI. With the wide use of workflow platforms such as Jupyter and RMarkdown in geosciences, there will be more studies of using KGs to improve XAI.

Concluding remarks
Data-intensive geosciences often rely on the collaboration of researchers from different disciplinary backgrounds, such as computer science, statistics, information science, and the various sub-disciplines in geosciences. KGs have been proved to be an efficient way to bridge the gap between those disciplines and facilitate communication and collaboration within a team. First, KGs are able to present a quick overview of the major entities, relationships, and structures of the scientific subjects in a research. Second, there can be smart functions that chain up data, software, research topics, and researchers in the cyberinfrastructure underpinned by KGs, such as those in recommender systems. Third, KGs can be used into data analysis workflows to improve the quality and interoperability of results. Together with the open data environment, advanced data science methods, and innovative data visualization techniques, KGs will make solid contribution to data-intensive, multi-disciplinary geoscience studies.
This review paper shows that there is a lot of space and flexibility for the future work of KG creation and application in geosciences. In the field of Semantic Web, there is a famous slogan "A little semantics goes a long way", which is also true for KGs in geosciences. Any KG-based updates to the data life cycle, such as metadata annotation, data discovery, data cleansing and integration, and KG-infused machine learning will benefit the data-intensive geosciences. Usually, researchers need to balance three factors relevant to a KG: expressivity, implementability, and maintainability . Expressivity is the granularity of semantics in a KG; implementability is the usability and usefulness of the KG in the real-world applications; and maintainability is the evolution and upgrading of the KG in a long-term perspective.
A higher visibility of KGs in geosciences rely on the appearance of more innovative research results as well as the education of this topic among geoscience practitioners, especially students. The Living Textbook developed by geoscience researchers and educators Lemmens et al., 2018) demonstrate several interesting features by using KGs. It deploys a concept map to visualize the key knowledge items and their relationships in a course, together with wiki-style text to show the details. Several interactive functions are made available for teachers and students. Teachers can create mind maps to customize the clusters and learning paths of subjects in a course. Students can explore the concept map of the whole course, follow the learning paths created by teachers, and make notes in the text. The Living Textbook not only creates a better learning experience of geosciences but also demonstrates the advantage of KGs to students.
We hope the concept descriptions, exemplar studies, best practices, and trend analyses presented in this paper will be of benefit to both geoscientists and computer scientists, especially those who are working on the creation and implementation of KGs in geosciences.