Exploring CHEMeDATA. An interview with Damien Jeannerat: What is the CHEMeDATA movement?

CHEMeDATA.org is an extension of the NMReDATA.org initiative, which emanated from the NMR community. In this community, there was a significant challenge in sharing assignment information. There are multiple databases where you can find diverse types of NMR spectra associated to compounds but no common system. Unlike, for example, X-ray crystallography, where there is a habit of saving data in a format that can be used by everybody,inNMRthereisnogenerallyacceptedpracticeorcentraldatabase.Therearerulesregardinghowtopresentdataforpublishing,butfor sharingthespectra,forexample,thereisalackofstandardization.TheNMReDATAinitiativeproposedaformatforannotatingNMRassignments.It makesthelinkbetweentheinformationinthespectra(signalshavechemicalshifts,astructure,and,often,couplingconstants)andthecorrespond-ing proton and carbons in the molecule. The challenge is to do this in a way that is not only human readable but also understandable by a computer. The NMReDATA initiative was a success. The community adoption was quite fast, and the most important part was the inclusion of the format by softwaredevelopers.NowthemainNMRprocessingsoftwareplatformsenableexportandstorageinthisnewformat.Thenextstepisforchemists to provide these NMRrecords (orNMReDATA) assupplementary data

IUPAC project on the "Development of a Standard for FAIR Data Management of Spectroscopic Data, " is addressing this problem and as an early implementation, the CHEMeDATA initiative will work on methods to identify and list the content of chemistry archive files and code the relations between their elements.In this manner, when we find a reference to chemistry data listed on an editor's site, or a university archive, one could determine its content without having to download the whole dataset and look into it.Just displaying a set of smart labels would make it clear to people and computer what is there.This system will also allow exchange between databases.For example, service specialized in, say, NMR, could collect the assignment of organic compounds, another, the IR spectra, provided the license permits it.This will make data become truly searchable -which is the basis of the first letter of the "FAIR" principles; "Findability." 2

WHAT ARE THE ADVANTAGES FOR RESEARCHERS?
The most obvious advantage for researchers is to increase the impact of their work.This also reflects on the visibility of the researcher.If the work that someone has performed can be used in different ways and be cited outside the circles of specialists it demonstrates the value of the research outside of the conclusion of the day.Let us consider the satellite images acquired for a weather forecast.They have an immediate use, for the weather forecast but a good archive of these same images can also be used to, say, model the maturity of crops as a function of sun exposure through the comparison of previous images.In fact, indirect uses turn out to often have longer lasting impact.What will mater in the future is very difficult to anticipate.Some may care only about the 13-C satellites of your 1D proton NMR spectra or the unaccounted presence of compounds considered as "artifacts" by today's chemist.Maybe it is all under the noise level -is not an exciting prospect!?In short, the influence of the research can multiply and be used in secondary applications.
One can also mention of a more direct scientific benefit; consider that you specialize in a narrow field of a certain types of natural products.When analyzing your NMR spectra, you use computer-assisted structure determination utilizing chemical shift predictions (among other tools).Under the hood, these software packages rely on complex algorithm "learning" from the NMR data available at the time of the software's release.If you work on a new class of compounds, prediction of chemical shifts may be substandard and cause concern due to the discrepancies with the experimental data.Indeed, predicted chemical shifts will have lower precision and accuracy when experimental data are lacking.If you publish your NMR data and the chemical shifts are well reported and computer readable, future releases of the software would include these new data in the training of their algorithm and YOU directly benefit the most of sharing your data because it improves the prediction of the type of compounds you study.In short, by sharing your data, you contribute to better chemical shift prediction tools -just to take one example.But the same will be true for NMR coupling constants and other spectroscopic information. 3

WHAT DO YOU THINK ARE THE CHALLENGES FOR THE COMMUNITY ON ACHIEVING THESE GOALS? WHAT IS REQUIRED BY THE COMMUNITY TO ACHIEVE TO ACHIEVE THESE GOALS?
The community have to think on how to support the FAIR principles -in particular how to make the type of data we use every day more "Findable." The keys points will be to define relevant "chemistry objects" in an electronic manner.The questions to answer are: What are their relevant parameters and their outcome.For example, one could say about an NMR spectrum, that the Larmor frequency is one of the parameters, a peak at a given chemical shift is an outcome.An NMR assignment should probably be a separate entity, where a peak at a given chemical shift is the input and the relation between the peak and a hydrogen atom in the chemical structure the outcome.NMR is well covered, but other spectroscopy and chemical information need some input.At present there are a relatively small group of people interested in Open Data.Sometimes the expertise is not in the right hands, for example, you have very good expert on metadata, who may have little exposure to the chemical problem.It is difficult to get the right combination of people.Currently we rely on the goodwill of a few who can make work-case and demonstration examples.At some point, this work will need financial support to provide tools to evaluate recommended practice, generate and validate the newly defined chemistry metadata, etc.It will follow the needs of the community, update recommendation according to new possibilities, etc.

WHAT CAN WE LEARN FROM DISCIPLINES SUCH AS THE CRYSTALLOGRAPHIC COMMUNITY IN ACHIEVING THE MISSION OF THE CHEMEDATA MOVEMENT?
The crystallography community are inspiring because they succeeded very well.I do not think it is fair to say that it was easier for them but the kind of data they produce allowed a more direct access to the underlying information than, say, NMR.With the generation of three-dimensional structures, there are fewer variables and less error-prone human intervention.Access to the electronic information for the crystallographic community appeared early enough in the development of the web that having a centralized place made sense and this community have continued with that trajectory. 19Should all chemistry information be stored at a central database with the model of The Cambridge Crystallographic Data Centre (CCDC)?Right now, the tendency is clearly going towards multiple initiatives that coexists at multiple locations.My expectation is to see a range of databases and services with diverse shape and sizes.Some will have a broad range of data type -for example, Institution repositories or archived service such as Zenodo.Others will specialize in a specific field and include, say, the NMR spectra of organic compounds.Some will probably attempt to embrace the entire domain of chemistry.Such a service would probably only include metadata and forward the user to the actual location of the data.They may be used as search tools and links towards horizontal and vertical sources.Others may focus on the chronological evolution of chemistry information.Indeed, correcting and complementing data require a powerful versioning system allowing to archive, rank and evaluate contributions by possibly very diverse contributors -including robots.

WHAT DO YOU SEE AS THE NEXT STEPS FOR THE CHEMEDATA MOVEMENT?
I think the next step is to communicate the need to simply provide chemistry data at the time of publication.One should not wait for perfectly annotated data to start the good habit of sharing chemistry data.Sure, the metadata associated with the chemistry data will greatly increase visibility and reusability but future progress in artificial intelligence may be quite able to fill the gap.In parallel we should be cognizant of endeavors in this space by different initiatives undertaken by other fields of science.This will help determine what is viable for the chemistry community and to see what recommendations IUPAC can synthesize from this priori.The biggest error would be to do something completely new; we should take the existing, push towards the broadly accepted format and see how the community reacts.We should keep in mind that companies will only implement recommendations involving simple changes to their products unless the new data opens now business opportunities.

HOW CAN WE AS A COMMUNITY, GET INVOLVED?
First, better organize your data to prepare them for sharing them when needed.Don't forget the structure files -don't worry about redundancy and uneven quality of the data -computers don't determine the fate of your soul.If your favourite journals are not requesting data, store them on third party archive services -it may well be that you will be the first to use them a few years later after you have changed computer four times and locations three!One last thing: If you ever post one of these unusable pdfs including images of your NMR spectra (because this is what people do) please include the crude spectra from the NMR instrument.The later are the file structures with numbers including fid, parameters, etc.Secondly, include the results of your hard assignment work!So, provide the file generated by your favorite NMR assignment software -don't waste it!