Biodiversity databases in Russia: towards a national portal 1

: Russia holds massive biodiversity data accumulated in botanical and zoological collections, literature publications, annual reports of natural reserves, nature conservation, and monitoring study project reports. While some data have been digitized and organized in databases or spreadsheets, most of the biodiversity data in Russia remain dormant and digitally inaccessible. Concepts of open access to research data is spreading, and the lack of data publishing tradition and of use of data standards remain prominent. A national biodiversity information system is lacking and most of the biodiversity data are not available or the available data are not consolidated. As a result, Russian biodiversity data remain frag-mented and inaccessible for researchers. The majority of Russian biodiversity databases do not have web interfaces and are accessible only to a limited numbers of researchers. The main reason for lack of access to these resources relates to the fact that the databases have previously been developed only as a local resource. In addition, many sources have previously been developed in the desktop database environments mainly using MS Access and, in some cases, earlier DBMS for DOS, i.e., file-server system, which does not have the functionality to create access to records through a web interface. Among the databases with a web interface, a few information systems have interactive maps with the species occurrence data and systems allowing registered users to upload data. It is important to note that the conceptual structures of these databases were created without taking into account modern standards of the Darwin Core; furthermore, some data sources were developed prior to the first work version of the Darwin Core release in 2001. Despite the complexity and size of the biodiversity data landscape in Russia, the interest in publishing data through international biodiversity portals is increasing among Russian researchers. Since 2014, institutional data publishers in Russia have published about 140 000 species occurrences through gbif.org. The increase in data publishing activity calls for the creation of a GBIF node in Russia, aiming to support Russian biodiversity experts in international data work.


Introduction
Russia plays a key role in world biodiversity conservation, including conservation of Arctic ecosystems: 80% of Arctic species diversity is represented in Russia (Climate Change Impacts in the Russian Arctic, Searching for Ways for Adaptation 2009). More than 17 million km 2 of the terrestrial area of the Russian territory is comprised of polar deserts, tundra, forest tundra, taiga, mixed and deciduous forests, broad-leaved forests, steppe, semideserts, and subtropics. Mountain regions cover about a quarter of Russia, and significant territories are wetlands. The diversity of ecosystems translates into high species diversity, including more than 12 500 species of vascular plants, over 1500 species of vertebrates, and 100 000 species of invertebrates (The National Strategy for biodiversity conservation in Russia 2002). A few hundred years of the exploration of Russian flora, fauna, and mycota have generated a great body of biodiversity data. These data are found in different museums and herbaria and are reflected in literature data from different countries and researchers. Some data are already digitized and organized in databases, but most of the data are disaggregated and presented in different formats (see below). A central national biodiversity system is missing. The need to create and develop such a resource has been repeatedly discussed, and even though data standards and technology are available, little progress has been observed. The largest international open biodiversity information systems such as the Global Biodiversity Information Facility, GBIF (2016), Encyclopedia of Life, EoL (2016), Integrated Digitized Biocollections, iDigBio (2016, and many others use the international data standards developed by the Taxonomic Database Working Group, TDWG (2016). Open access technology and data standards allow all interested parties to upload and to publish their data through global portals and therefore to improve the discoverability of their data and significantly reduce the cost of the work using literature and collections. All of this contributes to the development of international research cooperation.
Published by NRC Research Press 561 Ivanova and Shashkov Many researchers in Russia remain uninvolved in this activity. However, in recent years, the interest in the publishing of data through gbif.org and activity to popularize GBIF in the Russian-speaking environment has appeared. In this paper, we summarize biodiversity data mobilization activities in Russia through the description of biodiversity databases and report progress towards the creation of a national GBIF node in Russia.

Review of Russian biodiversity information systems
Here we summarized descriptions of biodiversity databases using the available information from the literature and Russian biodiversity information systems via the internet. Even though this summary covers the key biodiversity data resources in Russia, many personal, institutional, and project databases remain unknown to us or are inaccessible through the internet (e.g., Zeltyn and Insarov 1993;Knyazeva et al. 2007;Golub et al. 2009;Kryshen et al. 2009;Chernenkova et al. 2012 and many others). The main reason for a lack of access to these resources relates to the fact that the databases have previously been developed only as a local resource. The authors did not wish to share their data and only announced in publications the fact of the existence of the database. Also, many resources have previously been developed in the desktop database environment (more often Microsoft Access), which did not have the functionality to create access to the data through a web interface. Furthermore, descriptions of such hidden data resources have not been published in the literature. While biodiversity papers do mention databases used for certain analyses, descriptions of the database structure, software, programming languages, and other details are much less visible or missing.
We have reviewed Russian biodiversity databases for the following characteristics: type of data, data standard, number of records, availability of the primary data, and web interface. Based on the content and primary foci of the reviewed systems, they were divided into three groups: occurrence databases (Table 1), taxonomic databases (Table 2), and digital collections (Table 3).

Occurrence databases
A huge amount of different resources in terms of volume, quality, and functionality have been developed over the last 20-25 years. In this section, the databases on species distribution are described.
While many databases of various scales exist and operate in isolation, technical specifications of the databases (such as the structure, data formats, and software used) are typically described very poorly.
The analysis of available metadata showed that information about the occurrence of different taxonomic groups of plants and animals is available via the internet (Table 1). Unfortunately, most of these resources have a local data standard even in the case where different resources contain similar data (e.g., Morozova and Borisov 2010and Dal'ke et al. 2014or Koropachinsky et al. 1999Abdrahimov et al. 2011; Biodiversity of Altai-Sayan Ecoregion 2016), and the conceptual structures of the databases were created without taking into account modern international standards (Wieczorek et al. 2012).
Database topics are often repeated. For example, invasive alien species are a very important group and a common target for the creation of the databases. One of the most significant initiatives, an information system Alien species of Russia (2016), is maintained by the Institute of Ecology and Evolution Russian Academy of Science (RAS) and covers plant species, insects, fishes, and mammals. Most of the data on the distribution of alien plant species are presented on the Web-Oriented Geoinformation System of Alien Plant Species of European Russia (Morozova and Borisov 2010). Information on alien plant species can also be found in the information system The Black Data Book of Russian Flora (2016). Furthermore, Most of the existing database resources are published as a finished closed system with little or no updates after their initial release (Table 1). This becomes apparent from the absence of any updates after several years. Far too often, the databases developed for research projects and hosted by the commercial web services become forever unavailable soon after the completion of the project (Shashkov and Ivanova 2012).
Spatial information on occurrences is often presented as raster images, not through mapping services (Table 1). Undoubtedly, such systems contain important information on biodiversity. Especially, such data are important for the assessment of biodiversity of insufficiently studied regions. Below, we outline a few key examples of such systems.
The information system on vertebrates in Russia (Vertebrate Animals of Russia 2016) includes information about taxonomic status, distribution, and recordings of voice and acoustic signals. Unfortunately, the web interface of this resource provides only the metadata and does not provides access to the database.
Information Retrieval System for Fauna and Flora in Protected Natural Areas of the Russian Federation (2016) integrates distribution data on fish, amphibians, reptiles, birds, mammals, vascular plants, lichens, mosses, hepaticae, and anthocerotae from Russian protected areas. Information about the biodiversity of Russian protected areas is also available via the portal Protected areas of Russia (2016). This is an ongoing project aiming at a mobilization and generalization of knowledge about the protected areas and providing information support for monitoring these areas.
Information about the biology and distribution of some taxonomic groups is summarized by researchers at the Institute of Biology of Komi. Available data are about biodiversity of dipteran insects of the "gnus" (midges) complex (parasitic Diptera) of northeastern European Russia (Panyukova et al. 2014). Data about Siberia are available in Flora Baikal Siberia (Abdrahimov et al. 2011), Biodiversity of animals and plants of Siberia (Koropachinsky et al. 1999), and Biodiversity of Altai-Sayan Ecoregion (2016).
Dynamic updatable maps (based on the OSM web service) includes the Cryptogamic Russian Information System, CRIS (Melechin et al. 2013), one of the most successful developments of its kind. The system has been developed as a tool for convenient storage, organization, integration, visualization, and analysis of data on the biodiversity of cryptograms. Currently, data from the Polar-Alpine Botanical Garden-Institute of N.A. Avrorin RAS herbarium collection (KPABG) and literature data (mainly for the Murmansk Region) are included in the CRIS. The system is fully developed using open-source software and a multiuser platform. Registered users can upload primary data. A special controlled vocabulary is used for description of species occurrences. These terms, except the general taxonomic and georeference terms, are to describe features specific to cryptogams, such as substrate type. Custom queries with different search criteria can be created. Maps of the occurrences are also available to users. Part of this information is published through gbif.org (Table 4: doi: 10.15468/nctfm2, 10.15468/80tu83, 10.15468/yxt7co).
Since the 1990s, biodiversity inventories and surveys, especially on rare species, are carried out by nongovernmental organizations in Russia. The data from such initiatives are often more easily accessible than data from the RAS institutes. The crowdsourcing project web-GIS Birdwatching (2016) is developed by the Siberian Environmental Center (Novosibirsk). This is an open database: any registered user can upload or download data. Users can upload and store their data on bird species occurrences and also create vector layers to the map system. Loading data is available in CSV format, KML/KMZ, ESRI shape, and MapInfo files and as doc files (reports). The system supports custom requests. Data collected through Birdwatching were used in at least 12 publications in Russian and international journals.    (Table 1).
Another category of regional biodiversity data sources is spatial information systems, such as the one on animal and plant species of Khanty-Mansi Autonomous Okrug, developed by NextGIS Ltd. (UgraBio. Information system of biodiversity of Ugra 2016). The information system is designed for management tasks such as checking for the presence of red-listed species in specific areas and to support preliminary scientific inquiries, such as modeling ranges of rare species and help in the assessment of the degree of rarity of a particular species. The main objective of the application is to show species locations and allow visualization and quick editing of data. Besides locations, the system allows automatic creation of species ranges from annotated lists (list of species for a specific area, not necessarily a point) assigned to grid cells. Currently, the database includes information about occurrences of Protozoa, Fungi, Plantae, and Animalia. Data can be added in the system or downloaded by registered users.
It is noteworthy that while all three open systems support the upload and download of data by registered users, the rules and licenses for the citation of the data are not always described. Each of these information systems uses its own data standards. Apparently, these standards have been developed based on the characteristics of the target taxa and the specific project goals. Use of different standards complicates the interoperability of the systems.
Globally, the Darwin Core standard, DwC (Wieczorek et al. 2012), is a leading global standard for biodiversity data. This standard is followed by the major international biodiversity information systems such as GBIF, EoL, ORNIS (2016), and many others.
To the best of our knowledge, only one Russian database is created using the DwC standard, the database of Lobaria pulmonaria occurrences in Russia (Shashkov and Ivanova 2012;Lobaria pulmonaria in Russia Information System 2016). This online database documents the rare lichen Lobaria pulmonaria in Russia. The database is comprised of data from the literature, herbarium collections, open databases, the authors' own field data, and personal communications of researchers and is aimed at supporting modeling of the population dynamics of Lobaria pulmonaria. The detailed descriptions are available for field recordings but missing for many herbarium-and literature-based records. The database is implemented based on an open object-relational database management system PostgreSQL (2016). For a detailed description of the Lobaria pulmonaria occurrences, about 60 DwC terms were selected. In addition, a number of non-DwC terms were suggested for detailed description of Lobaria pulmonaria findings (Fig. 1). Both DwC and non-DwC terms were structured into vocabularies and work tables. Five vocabularies were formed: three for administrative division (countries, regions, and administrative districts) and the other two a text description of accuracy of georeferencing and a list of host tree species. A few tables were "updatable vocabularies" in which new records are added in the course of working with the database: name of the data sets, collections, and bibliographic references. A detailed description of the occurrences was combined into three logical parts: (1) description of the location, (2) description of the habitat, and (3) description of the host tree and Lobaria pulmonaria population. If exact (e.g., with a GPS navigator), georeferencing was possible; one location (point) corresponded to one biotope (habitat) and to one or more occurrences (Fig. 2A).
If the georeferencing was not exact, typically based on a text description without geographic coordinates, location may correspond to multiple habitats (Fig. 2B).
The corresponding data set was published through gbif.org (Table 4: doi: 10.15468/ uennht). The data set is dynamically connected to the source database through a SQL query, the way that greatly simplifies the work with the data in comparison with CSV file loading. For the publication of the data contained in the database, the Integrated Publishing Toolkit (IPT) installation of the Institute of Mathematical Problems of Biology RAS (IMPB) was used (Russian GBIF IPT 2016). Through such a setup, all data from the resource database Lobaria.ru (occurrence map and viewing of attributive information of findings) are also available through gbif.org and all updates in the database are rapidly reflected in the data set on the global portal, which also allows for data downloads and issues digital object identifiers (DOIs) for each download. In 2016, the online version of the system contained data on more than 1200 occurrences of Lobaria pulmonaria.
Despite some progress in promoting of GBIF and data mobilization in Russia, a national system of biodiversity is still lacking, but the existing resource on Lobaria pulmonaria distribution can be a prototype of a database component of this system. Only a minor redesign of Lobaria pulmonaria database structure would allow scaling up for distribution data on other   Ivanova and Shashkov 571 taxonomic or ecological groups. In principle, the schema of three logical blocks locationhabitat-occurrence is already applicable to other taxa. For compatibility with GBIF, it will be necessary to use GBIF Backbone Taxonomy (GBIF Secretariat: GBIF Backbone Taxonomy 2016). A vast amount of biodiversity data in Russia is not digitized and is typically restricted in access. There is a significant overlap in the topics of the individual databases combined with the differences in data format resulting in a blockade of data exchange between different resources. A general lack of maintenance of the project database results in a very short lifetime of these potentially very valuable data products. Commercial web hosting can be considered good practice; the most advanced systems in our review use such an approach for data hosting, improving maintenance, and software upgrades at the resource. At the same time, several systems demonstrate successful examples of open multiuser databases.

Taxonomic databases
Among taxonomic databases, "Flora of vascular plants in the Central European Russia" (was developed in the IMPB) and the family of information systems of the Zoological Institute RAS based on the ZOOCOD standard are the best known ones ( Table 2).
The database "Flora of vascular plants in the Central European Russia" (Zaugol'nova and Khanina 1996) was developed for generalization and standardization of taxonomic lists used in different regions of central Russia. Species checklist are included for vascular plants from Moscow, Smolensk, Tver', Yaroslavl', Vladimir, Kostroma, Ivanovo, Ryazan', Tula, Kaluga, and Bryansk regions. For the packaging of systematic data in the relational structure, a special code was developed. Each species has a unique code consisting of nine numbers: the first three represent the family, the next three are used for the genus within the family, and the last three letters of the code represent the species within the genus. This nine-letter code is associated with a table of synonyms and reference tables on the ecology of individual species. More than 120 literature sources were used for the creation of the database. The web interface of the database was developed in 2004 and included a checklist of species according to Cherepanov (1995) and synonyms and biological and ecological characteristics of more than 2300 plant species (Flora of vascular plants in the Central European Russia 2016). The species checklist is available through gbif.org (Table 4: doi: 10.15468/96gqtn).
The local taxonomic standard (ZOOСOD) is developed in the Zoological Institute RAS. Construction of the ZOOCOD is detailed by Lobanov and Smirnov (1997). Each specimen has a unique code that describes its systematic position. The classifier concept was developed to demonstrate any hierarchy taxa detail in relational databases. All of the taxonomic information systems of the Zoological Institute RAS are based on ZOOCOD. The main sources of the Zoological Institute RAS are the ZOOlogical INTegrated retrieval system, ZooInt , the Russian Information system Biodiversity of Animals, ZooDiv (Biodiversity of animals. Russian information system 2016), information system Biodiversity in Russia (2016), and the Taxonomy and collectionsiInteractive database of world insect fauna (2016). These and other developments of the Zoological Institute RAS summarize data about taxonomy, biology, and bibliography of different groups of animals, protists, prokaryotes, fungi, and partially plants. For example, ZooDIV unites 32 systematic databases (>90 000 species) (Biodiversity of animals. Russian information system 2016).
The ZOOCOD standard has been successfully used outside of the Zoological Institute RAS: in the Botanical Institute RAS, the Institute of Ecology and Evolution RAS, at Moscow State University, at Nizhny Novgorod State University, and others  Arctic Science Vol. 3,2017 Internationally, the Catalogue of Life, COL (2016) is one of the most common basic taxonomic sources for the development of biodiversity databases. However, the Catalogue of Life is not yet complete and covers only 84% of world diversity. Many species recorded in Russia now are missing from the COL database, especially endemics of Russia and the former USSR. The integration of Russian species checklists, already summarized in relational databases into the Catalogue of Life (and as a result into the GBIF Taxonomic Backbone), would significantly expand its cover of species diversity and would provide the critical taxonomic foundation for the development of Russian biodiversity databases.

Digitized collections
The majority of Russian botanical and zoological collections are not digitized. However, a number of digitization projects in the country's largest collections have been launched recently ( Table 3 (Table 4: doi: 10.15468/nt9emp, 10.15468/cm3n7s). Data on fungi, hepatics, lichens, and mosses from the KPABG collection are available through CRIS (Melechin et al. 2013). Some data on the moss herbarium specimens from different Russian collections are available on the Arctoa web site (Arctoa. Project 'Flora of mosses of Russia' 2016). Last year, the work on digitization of the collection of the Zoological Institute RAS of the funds started. Today, specimens of Pogonophora, Coleoptera, Lepidoptera, Flea, Ophiuroidea, Reptilia, and Mammalia are available online (Digitized Research Collections of the Zoological Institute RAS 2016). Generalized data of labels and their original images as well as images of specimens are available (Zoological Institute of Russian Academy of Science 2016).
Digitization of Russian botanical and zoological collections is a very important activity for the global assessment of species diversity and distribution. According to the portal Genetic and biological (zoological and botanical) collections of the Russian Federation (2016), 148 herbarium collections from 102 cities were present in Russia in 2004. The collection of vascular plants of the LE herbarium contains more than 6 million specimens. Many Russian universities and scientific organizations have their own herbarium collections. The Herbarium of Tomsk State University (500 000 specimens), the Herbarium of Institute of Biology Komi Scientific Centre (180 000 specimens of vascular plants, 40 500 specimens of mosses, 18 000 specimens of lichens), the KPABG Herbarium (100 000 specimens), and the Herbarium of Institute of Biology of Inland Waters (>33 000 specimens of water and coastal water plants) are among the most significant collections. Almost all Russian nature reserves and some regional museums also have their own herbarium collections. The Zoological Institute RAS has one of the largest zoological collections in the world, with more than 60 million Russian data available on gbif.org and Russian GBIF community activities More than 1.6 million species occurrence records from Russia have been published through gbif.org (1035 data sets). About 95% of this data were published by institutions outside Russia, most data from the United Kingdom, United States, and Estonia. The first data set from Russia was published through gbif.org by the Zoological Institute RAS in 2011 (Table 4: doi: 10.15468/c9g3nw). Since 2014, Russian publishers made available about 140 000 species occurrences, not only for Russia but also for even dozens of countries and territories. At the time of writing, about 97 000 records for Russia in 15 occurrence data sets and one checklist data set were published through gbif.org by a few Russian institutions ( Table 4). Most of the data have been published by the largest Russian data holders: the Zoological Institute RAS, A.N. Severtsov Institute of Ecology and Evolution RAS, and Moscow State University.
Data mobilization through GBIF is carried out through four Russian IPT installations, and half of the data sets are published through an IPT installation hosted by the IMPB. Even though the institute is not a large data holder, this is currently the most active technical support hub for data publishing through gbif.org for Russian institutes. This IPT installation (Russian GBIF IPT 2016) is associated with the information web site gbif.ru (2016). This resource contains information about the structure and functioning of the gbif.org portal, Darwin Core standards specification in Russian, and information about events connected with GBIF. gbif.ru is a base for collection and generalization of metadata information about Russian resources on biodiversity.
gbif.ru is a very important source for informing the Russian research community about the use of data standards and data mobilization. Important activities include workshops, which were organized by the IMPB in 2015 and 2016. The publication of Russian-language articles about modern data standards (Ivanova and Shashkov 2014;Grebennikov 2016) and a mostly completed IPT translation into Russian (mainly by the staff of the Institute of Biology of Komi Republic) will help mobilization of biodiversity data in Russia through gbif.org.

Conclusion
Considerable experience in biodiversity informatics has accumulated in Russia, but a nation-wide portal on biodiversity is lacking. This and an earlier review (Ivanova and Shashkov 2014) as well as a review of the information systems used in Russian nature reserves (Grebennikov 2016) suggest that the creation of a national portal is necessary and should be based on the international Darwin Core standard. Creation of a national GBIF node in Russia depends on formal participation of the Russian Federation in GBIF through signature of the GBIF Memorandum of Understanding (2010) and would support data authorship protecting and contribute to national and global biodiversity science.