Evolution of DFT studies in view of a scientometric perspective

Background This bibliometric study aims to analyze the publications in which density functional theory (DFT) plays a major role. The bibliometric analysis is performed on the full publication volume of 114,138 publications as well as sub-sets defined in terms of six different types of compounds and nine different research topics. Also, a compound analysis is presented that shows how many compounds with specific elements are known to be calculated with DFT. This analysis is done for each element from hydrogen to nobelium. Results We find that hydrogen, carbon, nitrogen, and oxygen occur most often in compounds calculated with DFT in terms of absolute numbers, but a relative perspective shows that DFT calculations were performed rather often in comparison with experiments for rare gas elements, many actinides, some transition metals, and polonium. Conclusions The annual publication volume of DFT literature continues to grow steadily. The number of publications doubles approximately every 5–6 years while a doubling of publication volume every 11 years is observed for the CAplus database (14 years if patents are excluded). Calculations of the structure and energy of compounds dominate the DFT literature.

Individual researchers in the field of DFT do have a qualitative overview about publications related to DFT Open Access *Correspondence: r.haunschild@fkf.mpg.de and compounds computed with DFT, but a quantitative overview can only be obtained using bibliometric methods. Although there is considerable interest in the evolution of the annual publication volume in the field of DFT [56,57], no detailed bibliometric study was published about DFT publications so far. We intend to fill the gap with this study.
Bibliometrics or the broader term, scientometricsboth terms are often used synonymously-can be characterized as the discipline that treats science quantitatively [58,59]. Publication and citation numbers are the most important items that have become the basis of bibliometric indicators for research evaluation purposes. In many disciplines, particularly in chemistry, physics, and materials science, chemical compounds (substances) play a major role. In a previous paper [60], we have extended the bibliometric method and defined compound-based (chemical) bibliometrics as a new research field. The method can be applied to analyze large numbers of publications and compounds in combination with the corresponding chemical concepts: We can establish the time evolution of the publications dealing with concepts or methods and reveal the related compounds or compound classes. Furthermore, the mapping of method related compounds by establishing element-based landscapes has some potential to illustrate the compound basis of research topics.
Reference Publication Year Spectroscopy (RPYS) is a bibliometric method which can be used to locate seminal papers which are cited most frequently in a certain publication set [61]. The method is based on the analysis of cited references (i.e. the number of times a specific reference is included the reference lists) in published papers of certain scientific fields. Researchers in the field can answer the question about seminal papers only subjectively. RYPS can answer this question in an objective way by asking all researchers in the field (via the cited references in their publications) with subsequent quantitative analysis. Therefore, RPYS results often provide a different perspective or complement the individual expert's perspective on the field.

Methods
Our analysis is based on the search and retrieval functions of the databases offered by Chemical Abstracts Service (CAS), a division of the American Chemical Society (ACS). The CAS literature database (Chemical Abstracts Plus, CAplus SM ) covers scientific publications and patents since around 1900 (including the references cited therein since the publication year 1996). The CAS compound database (Registry SM ) contains all chemical species mentioned within the publications in chemistry and related fields, identified and registered by the CAS Registry system. All compound records are associated with a unique CAS Registry number. These items (publications and compounds) are called documents or records. We used both databases via the new platform of the Scientific and Technical Information Network (STN ® ) International. Both databases are connected to each other via Registry numbers ® (RNs). The content of both databases is also accessible with SciFinder ® . However, the STN platform provides more detailed search and analysis possibilities.
The CAplus publication records contain index terms (ITs, keywords carefully selected and assigned by the database producer CAS). We searched for the terms "DFT", "density functional theory", "d functional theory", and "TDDFT" in the IT fields of the CAplus database. Occurrences of "TD-DFT" and "time-dependent density functional theory" are also found by our aforementioned search terms. The search term "d functional theory" is not used by scientists using DFT but it is used by CAS indexers. In total, we found 114,138 documents published before the end of the year 2014 (at the date of searching the year 2015 was not completely covered by the database). Throughout this paper, we will refer to this set of 114,138 documents as "all DFT publications". Although indexing takes some time, we can expect that the publication years until 2014 are nearly complete. 102,880 documents (90.1 %) have at least one connection to a Registry compound record. Throughout this paper, we will refer to this set of 102,880 documents as "substance-related DFT publications". The compounds with at least one connection from a Registry to a CAplus record will be referred to as "DFT-related compounds". The remaining 9.9 % of the documents are either concerned with methodological developments or the calculated substances are not a major concern of the document.
We used the relationship between CAplus and Registry mainly to elucidate how often which elements are present in the corresponding compounds connected to DFT calculations. An example of the CAplus IT fields is shown in Table 1 using the document in Ref. [62].
For example, the first index term (IT1) shown in Table 1 contains the relevant compounds in the form of their RNs together with the controlled term "Properties (PRP)" in combination with the corresponding abbreviated author vocabulary ("Dewar-Chatt-Duncanson model reversed and bonding anal. of Ni, Pd, and Pt complexes [(PMe 3 ) 2 M-EX 3 ] with Group IIIA element E halide ligands EX 3 from DFT-BP86 calcns. "). This indicates that properties were calculated for the substances that correspond to the itemized RNs and are described by the abbreviated author vocabulary. The other IT fields contain additional combinations of controlled terms with abbreviated author vocabulary.
We use controlled terms supplied by the indexer (e.g., "Molecular structure", "Conformation", "Bond") to define sub-fields or topics within the corpus of DFT literature. The topics together with carefully selected index terms are presented in Table 2.
We also analyze the DFT publications with respect to seminal papers on which the DFT publications are based. Such seminal papers can be located using a bibliometric method called "Reference Publication Year Spectroscopy" (RPYS) [61] in combination with a recently developed tool named CRExplorer (http://www.crexplorer.net) [63]. The analysis of the publication years of the references cited by all the papers in a specific research field shows that (earlier) publication years are not equally represented. Some years occur particularly frequently among the references. The years appear as pronounced peaks in the distribution of the reference publication years (i.e. the RPYS spectrum). The peaks are frequently based on single early publications, which are highly cited compared to other early publications. The highly cited papers are usually of specific significance to the research field in question (here: DFT).
In a first step, the publication set is imported into the CRExplorer and all cited references are extracted. In a second step, equivalent references are clustered and merged. References below a threshold (here: 100 cited references) are removed to reduce the background noise and to sharpen the resulting spectrum. In the third and final step, the reference publication years are analyzed for frequently cited publications. We analyze the reference publication years (RPYs) between 1950 and 1990. It is very problematic to analyze younger RPYs than 1990, and 1950 is a reasonable choice as the oldest RPY for the topic DFT. Furthermore, older RPYs require a slightly different methodology, i.e., lower threshold of the number of cited references.

Overall growth and growth in terms of topics
The overall annual publication volume since 1980 that is concerned with DFT is shown in Fig. 1. Note that 13 DFT relevant publications (11 substance-related DFT publications) were published prior to 1980.
According to Fig. 1, the annual publication volume shows a strong increase since 1995. The curve of all DFT publications (blue line) is nearly parallel to the curve of DFT publications with a connection to a RN (substancerelated DFT publications, red line) until 2012. Probably, the indexing of the recent years still needs some time to be completed so the years 2013 and 2014 should be  Nearly all the topic curves in Fig. 2 show a decline or slowed growth rate in the years 2013 and 2014, just as the red curve showed in Fig. 1. This effect is probably also attributable to the delayed indexing for the recent years. The topics "Structure" and "Energy" start to increase before the other topics. As Fig. 2 indicates, index terms related to the topic "Energy" are only included in the  record if the determination of the energy plays an essential role in the publication. The index terms related to the topic "Energy" are not included in the record if the energy calculation is only necessary to obtain properties of substances. In order to calculate the structure of a substance, obviously, one has to calculate the energy first. In such instances, the index terms related to the topic "Energy" are not added to the list of index terms. "Relativity" and "Magnetism" increase at a much slower rate than the other topics. The nine topics comprise 86.5 % of all DFT publications and 95.6 % of the substance-related DFT publications.

Substance-related analysis of DFT literature
For the substance-related analysis, we extracted all Registry numbers from the publication set of all DFT papers (n = 114,138) and transferred them to the compound database Registry. The records of the compound database include various compound specific information, in particular the chemical names, molecular formulas, and structure diagrams. The search for the number of compounds indexed in DFT literature and containing specific elements was based on the molecular formula field. We determined how many compounds containing a specific element have been indexed within the DFT publication set. Figure 3 shows a periodic table where instead of the element symbols the absolute number of compounds within the DFT literature is given. It is important to note that numbers in the table may overlap, e.g. between C (467,192) and O (274,893). By far the most frequently occurring elements in compound-specific publications dealing with DFT calculations are hydrogen and carbon. Oxygen and nitrogen also occur very often in substance-related DFT calculations. The lanthanides and actinides occur about as often in compound-specific DFT calculations as the rare gas elements, with one exception: uranium occurs significantly more often than the other actinides. Figure 4 shows the percentage of compounds that have DFT-related publications registered relative to all registrations for each specific element. Although the absolute numbers in Fig. 3 are rather low, the percentages of DFT-related compounds are quite high for the rare gases, many actinides, and polonium. Also, some transition metals (e.g., gold, platinum, palladium, rhodium, ruthenium, and osmium) show rather high relative occurrences. Figure 3 shows very high absolute numbers for hydrogen, carbon, nitrogen, and oxygen whereas Fig. 4 shows that their relative share of DFT-related compounds is rather low.
In total, 558.619 DFT-related compounds were found. Figure 5 shows the share of each element relative to the total of 558.619 DFT-related compounds. The color-coding is essentially the same as in Fig. 3. Figure 6 shows the annual publication volume of DFT studies that investigate compounds containing certain elements centered on carbon-containing compounds. Only the elements shown are allowed to occur in the sum formula (e.g. in the case of C no elements other than carbon are allowed in the sum formula, CH indicates pure hydrocarbons, etc.). Organics is the super-set of CH, CHN, CHO, and CHNO. Of course, there are more organic compounds, but this analysis concentrates on pure organic compounds and excludes compounds with less common hetero-atoms. For comparison, also the total curve of all substance-related DFT publications is Most of the compounds contributing to the C curve are fullerenes. Additionally, different oxidation states and isotopes of the carbon atom are registered as different compounds. The curves of CHN, CHO, and CHNO are very similar. Probably, the reason is that O and NH are isoelectronic. Therefore, most CHO compounds can also be calculated when oxygen is substituted by an NH group. The curve "Organics" (according to our definition) covers 37.2 % (n = 38,277 papers) of the substance-related DFT literature. Again, the decline or slowed growth rate in the years 2013 and 2014 is probably caused by the delayed indexing for the recent publication years. Figure 7 shows the annual publication volume of DFT studies that investigate specific compound groups: inorganic metals, organometallic compounds, transition metal compounds, lanthanides, and actinides. Here, organometallic compounds are defined as a compound with at least one metal, carbon, and hydrogen atom. There is no restriction on additional elements. For comparison, The largest compounds calculated with DFT in terms of number of atoms are: C 6000 [65], C 5120 [66], and C 4860 [65]. All three compounds are fullerenes with icosahedral symmetry. Unfortunately, the Registry database does not have point groups as additional information for the registered molecules, so one cannot search for the largest asymmetric molecule calculated with DFT. Also, the information about employed basis sets and specific density functionals is often missing in the CAplus database. Therefore, it is not possible using our search strategy to find the computationally most demanding molecule calculated with DFT.  (1951, 1955, 1964/1965, 1970, 1972/1973, 1976, 1980, 1986, and 1988) can be located in the spectrum. The publications which are mainly responsible for these peaks are listed in Table 3. The red line in Fig. 8 visualizes the number of cited references per reference publication year. In order to identify those publication years with significantly more cited references than other years, the (absolute) deviation of the number of cited references in each year from the median of the number of cited references in the two previous, the current, and the two following years (t − 2; t − 1; t; t + 1; t + 2) is also visualized (blue line). This deviation from the 5-year median provides a curve smoother than the one in terms of absolute numbers. We used both curves for the identification of the peaks. Table 3 contains the seminal papers which are mainly responsible for the peaks. This is a highly selective method and many other seminal papers relevant to DFT are not mentioned in Table 3. However, such papers can be identified via the reference table alongside the spectrum in the CRExplorer.

Analysis of seminal DFT papers
The cited references CR1, CR4, CR5, and CR11-CR14 of Table 3 were mentioned in the Background Section of this study. Four of them (CR11-CR14) propose new density functional approximations or improvements to existing ones. The cited references CR4 and CR5 are the foundational publications for modern DFT by Hohenberg and Kohn (CR4) and Kohn and Sham (CR5). The cited reference CR1 is Slater's approximation to Hartree-Fock exchange. The seven other cited references in Table 3 are not specific about DFT. They are of a more In cited reference CR2 Roothaan proposes to construct molecular orbitals as a linear combination of atomic orbitals (LCAO). This proposal was made for Hartree-Fock theory but is used in virtually every widespread program package for post-Hartree-Fock and DFT calculations. In cited reference CR3 Mulliken proposed an electronic population analysis based on Roothaan's LCAO method. Using this methodology, it became possible to calculate partial charges and dipole moments.
Boys and Bernardi proposed in cited reference CR6 a new direct difference method for the computation of molecular interaction energies with reduced errors. Hehre, Ditchfield, and Pople presented new basis sets for the LCAO method in reference CR7. The 6-31G basis set, which became very popular, is among those basis sets presented in this cited reference. The relevance of polarization functions was pointed out by Hariharan and Pople in cited reference CR8, and the popular 6-31G* and 6-31G** basis sets were proposed. Baerends, Ellis, and Roos presented in cited reference CR9 a computational Hartree-Fock scheme using Slater's approximation and Roothaans LCAO ansatz where Slater-type atomic orbitals do not increase the computational demand compared to Gaussian-type orbitals. Cited reference CR10 by Monkhorst and Pack is the only cited reference in Table 3 concerned specifically with the solid state. They propose a method for generating sets of special points in the Brillouin zone. This method provides a more efficient algorithm to integrate periodic functions of the wave vector in solid state calculations.

Discussion
Most DFT literature is substance-related. Therefore, the publication volumes of the general DFT literature are very similar to the publication volumes of the substance-related DFT literature. In terms of absolute numbers, most compounds calculated by DFT contain hydrogen, carbon, nitrogen, or oxygen. Also, 37.2 % of the substance-related DFT literature is concerned with compounds build from these four elements. 81.6 % of the substance-related DFT literature is covered when broader compound groups (inorganic metals, organometallic compounds, transition metal compounds, lanthanides, and actinides) are considered additionally. However, a relative perspective shows that DFT calculations were performed rather often in comparison with experiments for rare gas elements, many actinides, and  polonium as well as some transition metals. Probably, we see rather high activity of DFT research for many actinides and polonium because of industrial interest in combination with interest in their radioactive decay. The interest in platinum, palladium, rhodium, ruthenium, and osmium might be due to their catalytic activity. The highly selective RPYS analysis shows the 14 most influential publications with relevance to DFT published between 1950 and 1990. Seven of these 14 publications were cited in the Background section of this manuscript. The other DFT publications cited in the Background section of this manuscript are newer or older.
We have to mention here the limitations of our study. Our retrieval strategy, based only on index terms, can be seen as a limitation as we obtain fewer publications this way than by a search in title, keywords, and abstract for DFT-related keywords. This strategy is a compromise to gather the publications where DFT plays a major role. Another search strategy would yield too many false positives, i.e. publications where DFT plays only a minor role although DFT-related keywords are mentioned in title, keywords, or abstract. In contrast to previous chemical bibliometric studies, we did not use only hit RNs but all RNs in the records. This is necessary because control tests showed that too few RNs were supplemented with DFT-related terms. However, we can assume that these limitations do not change the picture.
The RPYS analysis has certain additional limitations. There are also seminal DFT papers published before 1950 and after 1990. However, reference publication years younger than 1990 require a different technical treatment because of the exponential increase of the number of publications and cited references. Seminal papers before 1950 comprise the historical roots of DFT and are an interesting subject for another analysis. We chose to be highly selective in the identification of seminal papers in the DFT literature. A less selective procedure would result in many more seminal papers. Such a detailed analysis is possible but is beyond the scope of our current analysis.
Of course, our search and analysis strategy is not limited to the topic DFT. Similar bibliometric analyses can be performed for other topics where a connection between publications and chemical substances is important.