Data Import and Validation in the Inorganic Crystal Structure Database

In the following paper the input procedures for the Inorganic Crystal Structure Database (ICSD) will be outlined. The input flow of the data is explained. Since the data have been excerpted from journal articles a bibliometric analysis of the relevant literature is presented. The types of data and the form in which they are recorded are discussed. Finally, illustrations are given of the importance of data checking and the data checking procedures are described in detail.


Introduction
This paper describes how data are selected, obtained, recorded and checked for the Inorganic Crystal Structure Database (ICSD), which was originally developed at the University of Bonn and is now produced by FIZ Karlsruhe and the Gmelin Institute. 1 A general description of the ICSD database itself and of the corresponding retrieval tools has already been given by Prof. E. Fluck [E. Fluck, J. Res. Natl. Inst. Stand. Technol. 101, 217 (1996)] in his presentation. These will not be repeated.
For data to be included in ICSD, the following selection criteria are applied: Data are taken into account from all compounds which • have no C-C and/or C-H bonds in any residue • and contain at least one of the nonmetallic elements H(D), He, B, C, N, O, F, Ne, Si, P, S, Cl, Ar, As, Se, Br, Kr, Te, I, Xe, At, and Rn As far as the data themselves are concerned the following metadata, bibliographic data, crystal structure data and related parameters as well as properties are included: In the next sections details of the input flow paths, some bibliometric considerations concerning the journal articles from which the data are taken, and details of the data recording procedure will be given. Special emphasis is also laid on the validation of the data by automatic as well as manual checking. An earlier description of ICSD in this context can be found in Ref. [1].

Data Flow
The data flow is schematically represented in Fig. 1. Three paths are of importance in this context. First, the classical path, which is still the most important one. The overwhelming majority of the input still gets into the database in this way. This also means that the data stored in the database have been taken from journals. Secondly, we have a more modern path where the data are transmitted electronically by the authors. This way is a very new one and may be of more and more importance in the future. In this case, where the data are directly transmitted by authors there is no interruption (by printing and re-keying of the figures) in the data flow from the original measurement. Thirdly, in the near future we will receive data from publishers in electronic form. A first agreement has already been concluded with the IUC in this context.
A more detailed description of the input flow, which shows how the information is obtained for ICSD at present, is given in Fig. 2.
As mentioned before most of the data are taken from journal articles. Most of these are scanned in-house, the relevant articles are marked up, ICSD numbers are assigned to each entry, the data are excerpted and keyboarded. Then the data are checked by computer and manually. Finally the products (CD-ROM, online, magnetic tape) are created. For journals which do not contain so many articles with relevant data, searches in bibliographic databases are carried out, and the original documents are then ordered. The subsequent procedure is then the same as just explained. In some cases users inform us of missing data, which we then add. We also have a cooperation with the Institute of Crystallography of the Russian Academy of Science in Moscow. This institute delivers data to us in machine-readable form. The data are again checked at FIZ Karlsruhe.
In the case of a number of journals data which are not printed are deposited at FIZ in electronic form. Details are shown in Fig. 3. These data are electronically transmitted by e-mail to a mailbox at FIZ by the authors via telecommunication networks (Internet) and stored at FIZ. The relevant data are selected and converted to ICSD input format for further processing. Further data, for instance the volume number and pages, are added manually. These data also have to pass the checking procedure mentioned above.

Bibliometrics
As already said, practically all data originate from journal articles. Therefore it might be of some interest for prospective users to have a more detailed bibliometric analysis of the ICSD content. In Fig. 4 the development of the cumulative number of entries over time (publication year of the articles containing the data) is shown on a semilogarithmic scale. We immediately recognize the exponential growth of the total number of measured crystal structures in inorganic chemistry that exist up to now. The doubling time, which is 10.4 years, has nearly the same value as the doubling time for publications in physics and chemistry. By the way, the number of entries added to ICSD per year at present is about 2000.
In the next (Fig. 5) the Bradford distribution for ICSD is presented. Here, the cumulative number of entries as a percentage of the total is plotted as a function of the number of journals containing the data. The journals are ranked in decreasing order of productivity. The scale is semilogarithmic. The total number of entries at present is 38 869. The following conclusions can be drawn from the Bradford distribution for ICSD.
This means if one further journal is taken into account for x = 3 6.95 % of entries is added to ICSD, for x = 20 1.04 % of entries is added to ICSD, for x = 50 0.42 % of entries is added to ICSD.
One also immediately recognizes that about 50 % of the entries come from only 10 journals. In Fig. 6, the 15 journals with the largest number of entries (together with the percentage contributions to the total number of entries) are explicitly represented as a bar chart diagram for the total content of the database (they already represent 61.2 % of the total content). Over the years, however, some changes occurred as far as the contribution of different journals is concerned. Therefore, in Fig. 7 the same diagram is shown for the input of entries originating from articles of the publication years 1990-1993 in order to describe the situation which we have today (they already represent 71.5 % of the total input for these years). At present the input per year for ICSD originates from 100-200 journals. Of these, 21 journals are regularly scanned at FIZ Karlsruhe.  For input considerations it might be also of interest to have some information on the number of entries per journal article. Thus, Fig. 8 shows how many articles contain how many entries for the publication year 1992. For example, one sees that about 60 % of the articles contain only one entry. In a number of cases one compound has been investigated several times. Therefore, more than only one entry per compound is contained in the database for a number of compounds. This is demonstrated in Fig. 9.

Data Recording
How and in what form the data are recorded will be elucidated in the following. This can best be done by explaining the input record structure in detail. Here, all fields are listed which make up an input record. Then, the field contents are described, followed by an example for each case. In fields 7 and 9, the numbers following each Ϯ symbol represents the estimated standard deviation. Such an input record consists of the following fields:    xxxxx represents the special COL-number of the entry under consideration. For didactic reasons the contents are decoded in some way. Standard deviations are connected with a + sign only. An example of an input record is given in Table 1. The following software is used in the context of data recording: For administration a specially developed program (literature acquisition, duplication check, input status), for keyboarding SPF and Coledit, and for data checking R-Test and Coledit. The software makes the recording easy by applying predefined masks for the fields in which the data must be entered.

Data Checking
Data validation is a very important, even essential point in the whole input procedure. Various careful data checks have to be taken into consideration in this context. Here, in a first step, data checking by computer is applied as far as possible. For this purpose, use is made of formal checking procedures, of plausibility considerations, of constraints following from mathematical and physical laws and of the fact that redundant data have to be consistent. The latter point is illustrated in Fig. 10 where the most important relations which are used for checks are summarised.  • Validity of multiplicity The multiplicity is adjusted to the coordinates. Then it is checked for consistency.
• Plausibility of interatomic distances The distances are calculated on the basis of the atomic coordinates and of cell parameters, and are then compared with the distances estimated from the ionic radii of atoms.
• Validity of electroneutrality The total charge must be zero.
• Validity of molecular formula The molecular formula is calculated from atomic parameters, site occupation and site multiplicity, and compared with the corresponding formula given by the author.
• Comparison of calculated and measured densities The density calculated on the basis of molecular formula and unit cell dimension must agree with the measured density within certain limits.
In the process of data checking by computer all errors detected by the applied software programs (R-Test, Coledit) are corrected as far as possible. An example of an input record with the corresponding checking diagnostics is shown in As one can easily see, these test flags contain certain warnings which might be very useful in some cases.
Last but not least, it should also be mentioned in this context that some information stored in the database is automatically generated by computer by making use of the data already entered. These are the so-called implicit descriptors which are listed in the following: • Crystal system (SYST) Of course, in the database the implicit descriptors can be searched for in the same way as the other data.
It is very clear that data checking by computer always requires manual checking in addition. For ICSD the manual checking consists of 1) checking of the relevance of the special entry 2) checking of the chemical nomenclature (according to the IUPAC rules) 3) checking of the mineralogical nomenclature (according to the IMA convention) and of the phase designations 4) special evaluation of the diagnostics of the checking programs for the following topics • oxidation state • space group • unit cell parameters • atomic coordinates 5) further checking of • bibliographic records • formula structure • site occupations • remarks In conclusion the prospective user may now have an impression of the input policy of the ICSD database. Input policy always has also to find the right balance between cost-effectiveness and quality of a database as determined by completeness, accuracy and actuality. I have tried to demonstrate what kind of efforts are made for the ICSD database.