Curating and extending data for language comparison in Concepticon and NoRaRe

Language comparison requires user-friendly tools that facilitate the standardization of linguistic data. We present two resources built on the basis of a standardized cross-linguistic format and show how the data is curated and extended. The first resource, the Concepticon, is a reference catalog for standardized concepts from linguistic research. While curating the Concepticon, we found that a variety of studies in distinct research fields collected information on word properties. However, until recently, no resource existed that contained these data to enable the comparison of the different word properties across languages. This gap was filled by the Database of Norms, Ratings, and Relations (NoRaRe), which is an extension of the Concepticon. Here, we present the major release of both resources - Concepticon Version 3.0 and NoRaRe Version 1.0 - which represents an important step in our data development. We show that extending and adapting the data curation workflow in Concepticon to NoRaRe is useful for the standardization of cross-linguistic datasets. In addition, combining datasets from different research fields enables studies grounded in language comparison. Concepticon and NoRaRe include lexical data for various languages, tools for test-driven data curation, and the possibility for data reuse. The first major release of NoRaRe is also accompanied by a new web application that allows convenient access to the data.


Introduction
The comparison of languages is made possible by standardizing data from various sources.To facilitate this standardization, we need tools to help systematically unify the data and provide them in a FAIR format, i.e., the data need to be findable, accessible, interoperable, and reusable (Wilkinson et al., 2016).Striving for this standard is especially difficult when dealing with linguistic data since languages vary greatly and language scientists choose to structure their data differently.Therefore, it is necessary to create a standardized format that applies to all languages, and at the same time, to provide the tools for effortless standardization.A community-led initiative has recognized this need and developed the Cross-Linguistic Data Formats (CLDF, Forkel et al., 2018) which provide specifications on how to format a given dataset to comply with the FAIR principles (Wilkinson et al., 2016).The CLDF format specifically targets interoperability and reusability while the storage of the data on Zenodo (zenodo.org)accounts for findability and accessibility.The advantage of this framework is that cross-linguistic data from diverse languages becomes comparable by converting them into a standardized tabular format and thus, adding new data becomes straightforward.The data curation workflows established through CLDF also allow the curation and extension of data for language comparison.We apply a test-driven data approach, i.e., specific tests are carried out that match the formal requirements of the data automatically with the specifications developed in CLDF.
To compare lexical data across diverse languages, we use the CLDF format to curate resources such as the Concepticon (List et al., 2016a) and the Database of Norms, Ratings, and Relations (NoRaRe, Tjuka et al., 2022a).The goal of the Concepticon is to equip linguists with a reference catalog of "comparative concepts" (Haspelmath, 2010) through linking concept lists to standardized concept sets.As the Concepticon continues to develop, data curation workflows have proven useful in adding new data and improving existing data.From the beginning, the Concepticon contained a small number of metadata on word properties including age-of-acquisition ratings (e.g., Kuperman et al., 2012), naming tests (e.g., Ardila, 2007), and links to other databases such as BabelNet (Navigli & Ponzetto, 2012).However, the data were not continually enriched and a variety of different types of data seemed to be accumulating from different research fields, for example, psychology and natural language processing.We, therefore, decided to construct a new resource, building on the Concepticon but using a customized workflow for the available data.This led to the creation of NoRaRe (Tjuka et al., 2022a).The goal of NoRaRe is to facilitate exchange between different research fields in order to answer big-picture questions using cross-linguistic comparison.The data in Version 0.2 (Tjuka et al., 2021) included norms, ratings, and relations from studies in linguistics and psychology offering information on word properties such as word frequencies (e.g., Brysbaert & New, 2009), sensory modality ratings (e.g., Lynott et al., 2020), and similarity estimations (e.g., Hill et al., 2015), among others.
Here, we introduce the major release of Concepticon Version 3.0 (List et al., 2022a) and NoRaRe Version 1.0 (Tjuka et al., 2022b).Apart from new data and improvements, the releases include refinements to the accompanying Python packages, the publication of Concepticon and NoRaRe as CLDF datasets and the publication of NoRaRe in a web application built on Cross-Linguistic Linked Data (CLLD, clld.org).The improvements represent an important step in the development of both resources, and we illustrate the data curation workflows implemented in Concepticon Version 3.0 and NoRaRe Version 1.0 below.Due to the scope of the data note, we cannot present the entire background information on both resources (for detailed overviews, see List et al., 2016a;Tjuka et al., 2022a).The interested reader may find additional information on technical details and tutorials on how to use the tools presented here on our blog Computer-Assisted Language Comparison in Practice (calc.hypotheses.org).Table 1 contains a glossary of relevant terminology.

Amendments from Version 2
In this version, we implemented minor changes based on the reviewer's suggestions.The changes include two additions: details on the ontological categories and a definition of the term "typed data".
Any further responses from the reviewers can be found at the end of the article

Concept and word lists
The Concepticon began with the collection of concept lists from studies in historical linguistics using cross-linguistic comparisons to create language family trees.These concept lists include basic vocabulary and cross-linguistically comparable concepts such as HAND, TREE, YOU, or GIVE.Historical linguists have used different versions of these lists to elicit the glosses for the concepts across languages and determine cognates indicating language relatedness.At the time one of the most commonly used lists of concepts was created (Swadesh, 1955), there was a lack of standardization efforts, so subsequent studies expanded or adapted the original list as they saw fit (List, 2018).The Concepticon was the first resource to include various concept lists and make them comparable.It enables researchers to find and use the available concept lists for their studies.
On the surface, concept lists look like a list of words.However, words represent concepts in the mind, and in the case of language comparison, there may not be a translational equivalent for a given concept in a language.Similarly, word lists that are used in psychology elicit word properties of concepts to determine whether they are perceived as abstract or concrete, positive or negative, etc.These studies offer additional data on a given word or concept in an individual language and are also integrated into the Concepticon (List et al., 2016a).While studies in linguistics are comprised of a small set of items, with the rise of large-scale data collection, psychologists publish word lists that include thousands of words.In order to incorporate these data, the Concepticon was extended by establishing the NoRaRe database (Tjuka et al., 2022a).
The different contents of the concept lists in Concepticon are accounted for by giving them tags which make it more straightforward to search for a particular kind of data (List, 2018).Table 2 shows the 22 tags we use in the Concepticon (List et al., 2022a).For example, the tag areal comprises lists that are used to elicit concepts in a particular geographic area such as Vanuatu (e.g., Walworth & Shimelman, 2018).The lists with the tag body parts include studies that elicit body part terms in various languages (e.g., Majid & van Staden, 2015).

Term Definition elicitation gloss
In linguistic fieldwork, an "elicitation gloss" is used to denote a concept in a metalanguage.For example, when English is used as a metalanguage, a researcher would use the elicitation gloss tree to elicit the expression for the concept TREE in another language, while a Spanish researcher would use the elicitation gloss árbol to elicit the same concept.

concept
We define "concept" as a non-linguistic psychological representation of an object in the world.The standardised concepts in Concepticon are built on the theoretical framework of "comparative concepts" proposed by Haspelmath (2010).These comparative concepts are defined by the researcher as a tool for language comparison.In psychology, there is a longstanding debate about what a "concept" constitutes (cf.Barsalou, 2017; Malt & Majid (2013); Murphy, 2002).In comparison, the discussions in linguistics focus mainly on the term "word meaning" (cf.Bolognesi, 2020; Riemer, 2014).We do not propose a solution for these theoretical discussions.
concept identifier A "concept identifier" is a unique number that is used to identify a concept set.

concept label
A "concept label" is a unique word or phrase describing a concept set.

concept set
The concept identifier and the concept label together form a "concept set", for example, 906 TREE.Concept sets in Concepticon are described with additional metadata: a definition, a semantic field, and an ontological category.The concept sets are defined to standardize the concepts which are used for language description and comparison.

concept list
The term "concept list" refers to a compilation of concepts in the form of elicitation glosses.They are used by linguists who want to elicit a concept in a particular language.In contrast to dictionaries, the lists are based on questionnaires and are compiled for language comparison or documentation (List, 2018).The list is usually a table of elicitation glosses such as tree, you, what, and bring which represent the concepts TREE, YOU, WHAT, and BRING.

CLLD
Cross-Linguistic Linked Data (clld.org):The overarching project structure under which all of the data are published.

CLDF
Cross-Linguistic Data Formats (Forkel et al., 2018): The format used to standardize the information provided in linguistic research.

JSON
JSON (JavaScript Object Notation): A text format which allows for import and export of tabular data (json.org/json-en.html).

SQL
Structured Query Language (SQL): A programming language to query and manage relational databases.(OmegaWiki, 2020).At this point, the available metadata were slightly hidden, and we noticed that for words and concepts, there is a whole range of other data on norms, ratings, and relations to be found.Therefore, we decided to launch a satellite project that builds on the established workflows in Concepticon making the available data from linguistics and psychology FAIR (Wilkinson et al., 2016), especially interoperable and reusable.This is how NoRaRe (Tjuka et al., 2022a) got started.Since 2019, we are continuously adding new data to Concepticon and NoRaRe.This is the result of our data curation workflows, which are straightforward and can be managed by external and internal collaborators (e.g., student assistants).A list of all contributors to date can be found here: github.com/concepticon/concepticon-data/blob/v3.0.0/CONTRIBUTORS.md.The data curation of Concepticon is organized around the collaboration platform GitHub (concepticon-data repository: github.com/concepticon/concepticon-data).New lists or improvements start out as issues that can be triaged and addressed by the team of editors.If a list is selected for inclusion, the original data is transformed into a tabular format (if necessary), described by a metadata file, and additional metadata is added to the catalog.The concepts or words in the list are then mapped to the Concepticon concept sets (a description of the concept mapping is given below).To facilitate the process for internal and external contributors, we offer documentation under github.com/concepticon/concepticon-data/blob/v3.0.0/CONTRIBUTING.mdand blog posts that provide step-by-step tutorials (Tjuka, 2020a).All improvements and new concept lists are added by creating a pull request (PR) so that the changes can be reviewed by the Concepticon editors.The review process is described in detail in a blog post (see Tjuka, 2021a).The editors are a group of expert linguists who not only check the formal correctness of the contribution but also discuss questions of the content, for example, about concept mappings or additions of new concepts.Integrated into the PR are automated checks of the data that fail if a contribution is flawed.These checks have proven extremely useful since accidental mistakes such as spelling errors or missing concept set identifiers tend to creep into the contribution workflow.This is the advantage of a test-driven data curation workflow and additionally, we have been able to identify mistakes in the original concept list with this process.
The NoRaRe database is an extension of the Concepticon and thus, uses similar workflows.Established in 2020 with two minor releases (Tjuka et al., 2021), the first major release of NoRaRe Version 1.0 includes 113 datasets across 39 languages1 (Tjuka et al., 2022b).The NoRaRe database is also curated on GitHub (norare-data repository: github.com/concepticon/noraredata).To ensure the same quality of data as in Concepticon, improvements or new lists are added by creating a PR and are reviewed by one of the editors.Currently, the editor team of the NoRaRe database is small but will likely grow in the future.Documentation (see github.com/concepticon/norare-data/ tree/v1.0.1) and a step-by-step guide in form of a blog post (Tjuka, 2021b) are provided as references for internal and external contributors.The data curation is also test-driven in that consistency checks are integrated to identify mistakes easily.In addition, the NoRaRe database offers predefined scripts that allow a quick correlation analysis of the data.The scripts are included in the folder examples here: github.com/concepticon/norare-data/tree/v1.0.1/examples.
A blog post described the use of the scripts and how NoRaRe datasets are compared (Tjuka, 2021c).For example, we compared word frequencies across English, German, and Chinese, and the results showed that word frequencies were similar (Tjuka, 2020b).Another study compared sensory modality across English, Dutch, and Italian showing subtle differences in each sensory modality (Tjuka, 2022).Thus, the database can be used to compare word properties and reveal both cross-linguistic similarities and differences.
The result of our data curation efforts is 413 concept lists across 56 languages in Concepticon Version 3.0 (List et al., 2022a).Concepticon currently has 3,914 concept sets with an average of 231.76 concept sets mapped in a given list.20,878 unique elicitation glosses are mapped to the Concepticon concept sets.NoRaRe Version 1.0 (Tjuka et al., 2022b) includes 113 datasets (25 norms, 65 ratings, and 38 relations) across 39 languages and data on 75 properties.Table 3 provides an overview of the descriptive statistics.

Manual and automated mapping
Detailed descriptions of our workflows are provided in the articles introducing Concepticon (List et al., 2016b) and NoRaRe (Tjuka et al., 2022a) as well as in the tutorials on our blog (Tjuka, 2020a;Tresoldi, 2019a;Tresoldi, 2019b).Here, we summarize the basic steps of the manual and automated mapping workflows.
The first step in the mapping workflow in Concepticon is to generate a mapping to the Concepticon concept sets.The Concepticon concept sets include an identifier, a description, and elicitation glosses linked from each list.Table 4 shows a small subset of the 3,914 Concepticon concept sets.Each concept set has a unique identifier (ID) and a Concepticon gloss.In addition, the semantic field, a description, and the ontological category are provided.The ontological categories are based on those used by the World Loanword Database (Haspelmath & Tadmor, 2009).However, they have been modified with regard to the specific terms used, because we are dealing with concepts rather than words that can be divided into parts of speech (List et al., 2016a(List et al., : 2394).An algorithm based on previous mappings generates a list of pre-selected concept sets which are possible for a particular elicitation gloss (examples of elicitation glosses are illustrated in Figure 1).The contributor then checks whether the elicitation gloss represents the proposed Concepticon concept set.For example, depending on the information in the source, the elicitation gloss bank needs to be mapped either to the concept set 1284 BANK or 3463 RIVERBANK.It is important to note that we try to map as many elicitation glosses in a list as possible while at the same time, improving the mappings by not mapping an elicitation gloss to a concept set if the meaning cannot be disambiguated.For example, if a list contains the elicitation gloss smoke without any further information about whether the verb or noun meaning is intended, we do not map it to one of the concept sets 778 SMOKE (EXHAUST) or 1689 SMOKE (INHALE).Elicitation glosses which are not assigned to a concept set are automatically linked to the concept set 0 NA.The mappings found in the Concepticon are the basis for the automatic mapping workflow used in NoRaRe.
The automated mapping workflow in NoRaRe is used when it is not feasible to manually check each concept set mapping.Concept and word lists included in the Concepticon are small lists with about 100 to 1,000 items, whereas the basis for lists in NoRaRe can be thousands of words.The datasets in the NoRaRe database are generated by a Python script that automatically downloads the data from the source, i.e., an open repository or web page.Then the raw data are transformed into a tabular format and the words in the list are automatically mapped to the Concepticon concept sets.
For this procedure, the algorithm checks the elicitation glosses that are mapped to a given concept set in Concepticon (e.g., tree, árbol) and compares them with the word in a given list (e.g., tree, forest, wood).Since the mapping is idempotent the output is unchanged even if the algorithm is run multiple times on the same data.If there is a match, the word is mapped.The result is a reduced list that includes only the words from the original list that were mapped to the Concepticon concept sets.This procedure decreases the file size and avoids consuming an unnecessary amount of data volume.NoRaRe also includes mappings of typed data, for example, entries in Wikidata (wikidata.org)and BabelNet (babelnet.org),which is an added benefit compared to the data in Concepticon.The information on the websites is retrieved via an API and then stored in a JSON file with the corresponding metadata on data types and relations in the original dataset, i.e., typed data.

Python package pyconcepticon
The Python package pyconcepticon (pypi.org/project/pyconcepticon/3.0.0) supports data curation in Concepticon.To use pyconcepticon, a copy of the Concepticon data must be locally accessible.When pyconcepticon is installed with pip, the integrated commands can be called through the command line.Typing concepticon -h will give a list of the functionalities, for example, create_metadata which automatically creates the metadata JSON file.Guides on how to use the functionalities can be found on our blog (Tjuka, 2020a;Tresoldi, 2019a;Tresoldi, 2019b).
The pyconcepticon package stores a number of tests that allow for consistency checks of the Concepticon data.Especially, if a new concept list or improvements on existing data are added, the tests that run with the command concepticon test can spot inconsistent mappings, missing files, incorrect numbering of the concept sets, and many more mistakes.To check the consistency of an individual concept list, one can use the command concepticon check.The command concepticon validate inspects the availability of metadata for all concept lists.The pyconcepticon package includes several more commands that also simplify the addition of new lists as well as inspecting the available data.

Python package pynorare
Similar to the pyconcepticon package, we created a Python package for the curation of the NoRaRe data collection, called pynorare (pypi.org/project/pynorare/1.0.1).The pynorare package can be installed with pip if a local copy of the NoRaRe data repository is downloaded.The command line is used to access the commands and a list of the functionalities can be retrieved by typing norare -h.Guides on how to use the functionalities can be found on our blog (Tjuka, 2021b;Tjuka, 2021c).
The consistency of the data in NoRaRe can be tested with the command norare check which checks the entries in the norare.tsvfile (github.com/concepticon/norare-data/blob/v1.0.1/norare.tsv).The norare.tsvfile includes information on all the variables in the NoRaRe datasets.It is important that these entries are consistent and that mistakes are immediately identified.Otherwise, the comparison across datasets would become unfeasible.Individual datasets can be checked for internal consistency by using norare validate.Furthermore, the command norare stats creates a summary statistic  across all NoRaRe datasets and calculates the number of concept sets that have at least one link to a NoRaRe dataset as well as the available number of datasets links for each concept set.

Increasing reusability through Cross-Linguistic Data Formats
Cross-Linguistic Data Formats (CLDF, An advantage of the CLDF datasets is that CLDF includes the information to load the data from Concepticon or NoRaRe into a relational database and perform various queries.The Python package pycldf (pypi.org/project/pycldf/1.29.0) can convert any CLDF dataset into an SQLite database that allows queries with SQL.This process replicates the construction of the web applications described below.For Concepticon, queries could include listing the Concepticon concept sets in a given concept list or showing concept set relations such as plotting all the narrower concept sets connected to 1262 BROTHER, i.e., 559 BROTHER (OF MAN), 560 BROTHER (OF WOMAN), 1759 OLDER BROTHER, 1760 YOUNGER BROTHER, etc.The relations broader versus narrower are used to indicate concept sets that refer to specific parts of a more general concept.For example, we established the broader concept set 3626 KNOW to relate it to the narrower concept sets 1410 KNOW (SOMETHING) and 2248 KNOW (SOMEONE) because we found that several concept lists did not provide a clear distinction between the two concepts.It is important to note that the Concepticon does not provide a full ontology but rather that the relations are established bottom-up and mainly serve as an exploration of the data instead of providing a basis for inferences.For NoRaRe, one could query word frequencies for words expressing the same concept.Another possibility would be to compute correlations by assembling data from different datasets, for example, arousal ratings from different languages.The use of the Python packages pandas (pypi.org/project/pandas/1.5.1) and seaborn (pypi.org/project/seaborn/0.12.1) allows the creation of a dot plot for the correlation.Note that variables may have multiple values assigned to the same concept set because different words were mapped to the same concept set.Although these cases are rare, researchers need to inspect the mapped words before deciding whether or not they want to include them in a correlation study.

CLLD web applications
The CLDF datasets described above are the input for the clld applications developed in the Cross-Linguistic Linked Data (CLLD) project (clld.organd documentation under clld.readthedocs.io/en/latest).The CLLD project allows the curation and development of lexical and grammatical databases.The clld toolkit (pypi.org/project/clld/10.0.0) is a Python package that integrates functions for building and maintaining CLLD web applications.These web applications (short: web apps) can be conveniently accessed via a web browser.They are also a good way to check the consistency of data and are a form of data reuse.The CLDF datasets of Concepticon and NoRaRe include all the information to create clld web apps for each data collection.Each new data release is accompanied by an update of the web applications.The most significant change for the Concepticon web app (concepticon.clld.org)apart from the data update is the integration of a link to the NoRaRe data and a summary statistic indicating the number of links to variables and datasets for each concept set that replace the metadata box (see Figure 2).
The clld web app for NoRaRe (norare.clld.org)was introduced for the first time with the major release of NoRaRe Version 1.0 (Tjuka et al., 2022b).The NoRaRe web application has a similar structure to the Concepticon web application while at the same time, the features of NoRaRe including a list of all datasets and the variables for each concept set are highlighted.The word cloud on the front page is automatically generated based on the tags used for each variable in NoRaRe (see Figure 3).The font size represents the frequency of the individual tags across all datasets in NoRaRe.Most NoRaRe datasets include multiple variables which become apparent by clicking on the link for a given dataset.The web application also shows which glosses are mapped to a Concepticon concept set.A map illustrates the distribution of languages associated with each value (see Figure 4).Since many datasets containing norms, ratings, and relations come from psychological studies, the bias toward Central European languages is obvious.However, once cross-linguistic data from linguistics is added, the distribution of languages extends to areas such as Africa and New Guinea.

Conclusion
The present article introduced the major release of the crosslinguistic databases Concepticon Version 3.0 (List et al., 2022a) and NoRaRe Version 1.0 (Tjuka et al., 2022b).We  discussed the contents of both resources and their underlying data curation workflows.The Concepticon includes standardized concept sets that allow comparison across many languages.NoRaRe offers data on norms, ratings, and relations for words and concepts and is an extension of the Concepticon.With the major release, new data were added, the data were published in CLDF format and for NoRaRe, a web application was created.
Our data curation workflows have proven applicable for the advancement of data for language comparison and we envision that other researchers use the two resources for their studies.The availability of the data as CLDF datasets and the Python packages make it possible to explore and test the data more conveniently.Web applications for Concepticon and NoRaRe offer an additional overview of the available data.At this point, Concepticon includes 413 concept lists across 56 languages and 3,914 concept sets.NoRaRe contains 113 datasets with 75 word properties across 39 languages.We intend to further expand these data collections in the future.

Ethical approval and consent
Ethical approval and consent were not required.
For The authors provide adequate information about their methods and materials, which enables researchers to reproduce the results of studies that utilize the data provided by these resources.Additionally, this information allows for the addition of new data to the resources by interested researchers.It is worth mentioning that several technical issues are discussed in blog posts authored by the editors, which may further encourage the use of these resources (and which can be seen as complementary to this data note).
In light of the growing need for standardized cross-linguistic datasets and their importance in enabling reproducible research in linguistics, the newer versions of Concepticon and NoRaRe are a timely and valuable contribution.Therefore, I recommend accepting this data note (with minor changes; see my comments below).

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound?Yes

Are the datasets clearly presented in a useable and accessible format? Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Historical Linguistics; Lexical Semantics; Lexical Typology I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
mappings by not mapping an elicitation gloss to a concept set if the meaning cannot be disambiguated." Providing an example where the described limitation made it impossible to create a mapping would be helpful for the reader.We integrated the suggestion of the reviewer by adding the following sentence."For example, if a list contains the gloss smoke without any further information about whether the verb or noun meaning is intended, we do not map it to one of the concept sets 778 SMOKE (EXHAUST)  Both datasets represent huge compilation and standardization efforts that are very useful for the community and allow for much larger-scale research than was possible before.Furthermore, the way the authors have made the data public follows FAIR principles: the data are easily findable, accessible, interoperable, and reusable.I particularly like it that they have built web apps that make browsing the datasets easy.Another great aspect is that the authors have worked towards a workflow that allows external people to contribute to the datasets.This is an aspect that merits more discussion in the paper.
While the work itself seems to be excellent (as far as I can tell; see below for missing information, which I need to be able to properly evaluate it) and is certainly relevant for the research community, the reporting in the article needs quite a bit of work.In particular, it should provide more information and less technical detail, as well as structure the information better, as follows.

ASPECT TO IMPROVE #1: MORE INFORMATION
The article should be self-contained and provide sufficient information about the dataset for the reader to be able to understand both how it was constructed and the information it contains.The following needs to be addressed: Please include a table with relevant descriptive statistics of the datasets in their 3.0 and 1.0 releases (minimally n. of datasets, concepts, words, languages, language families).

○
The main notions around which the data are organized should be more clearly explained, with examples.The notion of "concept set" was particularly hard for me to understand; I had to check the websites, and even there it took some digging: The example on the homepage of the CLLD web app (https://concepticon.clld.org/)talks about the concept set SIBLING, so it looks like the other concepts in the network shown (OLDER SIBLING, YOUNGER BROTHER, ...) are members of this set.However, if one queries the interface, OLDER SIBLING and YOUNGER BROTHER are themselves concept sets.It turns out that "concept set" does not refer to sets of concepts in Concepticon, but to sets of concepts defined in the primary datasets.Please explain and illustrate how you use the following terms: "concept", "concept set", "concept label", "concept list".Also explain how these terms relate to words as commonly defined by linguists, and what the place of words is in the Concepticon (including improving the claim that "concept lists look like a list of words.However, words represent concepts in the mind" --check e.g.G. Murphy's chapter on word meaning in The Big Book of Concepts).

○
Also provide information about the ontological categories: how was the set of categories defined, and how were they assigned to specific concepts?

○
The mapping between the primary datasets and Concepticon includes an automatic step: "the algorithm checks the elicitation glosses that are mapped to a given concept set in Concepticon and compares them with the word in a given list.(...) If there is a match, the word is mapped."Please explain how the checking and comparing is made, and what the criteria are to decide whether there is a match, at an adequate level of abstraction for the present report.Also, explain "elicitation glosses" --I thought they were definitions but Figure 1 suggests they are words.Maybe it depends on the dataset?
○ Also please provide information on how many datapoints from the primary sources are left ○ unlinked at present (after the manual and automatic mappings you have done so far).
Please explain the data curation workflow better, clearly specifying the different steps before moving to a description of each.CLDF is presented in different places and I am still not sure I understand how it is used for the present datasets --again, examples would help.Also please distinguish clearly in the text between the format and the functionalities it affords.

○
Please also clarify why you release datasets in two formats (CLDF, another --which?).
○ Also please include a brief note on how CLDF relates to other standardization efforts such as TEI.
○ ASPECT TO IMPROVE #2: LESS TECHNICAL DETAIL Please remove technical details that belong in the documentation, not the article.Two examples: 1) In the description of the Python packages, details such as "Typing concepticon -h will give a list of the functionalities."Important: Make sure that the information that is relevant to understand the functionalities of the packages remains. ○ 2) You include links to many specific files in the repositories.In the article, you should provide only the main links and the information that is relevant to understand what they contain.For where to find which specific kind of information, you should rely on the documentation in the repositories.

○ ○
Are you sure Table 1 is needed?If so, expand the information you provide --e.g.line 2 is not understandable ("Concept lists that contain further annotations which exceed the complexity of ranks"?).

○ ASPECT TO IMPROVE #3: STRUCTURE
The way the information is presented in the current version makes it difficult for the reader to understand the material.I list here what struck me the most, but the article needs to be thoroughly revised in this respect.The abstract mixes specific considerations about the historical development of the dataset that is being presented with general motivation and with relevant aspects of the dataset.The line "The Concepticon is based..." talks about the Concepticon as if it were already familiar to the reader.

○
In the introduction, after a 3-sentence general motivation, the article jumps into details without giving the general picture: "For this reason, a community-led initiative developed the Cross-Linguistic Data Formats (CLDF, Forkel et al., 2018) [...] The CLDF format specifically targets interoperability and reusability..." ○ "test-driven" is mentioned in the abstract and the keywords, but then it's not introduced explicitly in the article: the first mention is "This is the advantage of a test-driven data ○ curation workflow" --please explain what a test-driven data curation workflow is beforehand.
Please revise the structuring into sections: e.g., should section "Consistency and transparency through CLDF" really at the same level as Intro, Materials and Methods, etc.? Can you use a more transparent title?(This section is particularly unclear.)How about the subsections?
○ More generally, please remove the many redundancies in the text (including two long links to the markdown file containing the list of contributors!). ○

CONCLUSION
While the work has merit, I cannot approve the article in its present form.I am looking forward to a revised version.
---Typos/other: Figure 1 really helps! the "arbitrarité" part was confusing until I understood it was the logo for Concepticon.It's not the most transparent logo.:) (By the way, on the https://concepticon.clld.org/website it says "arbitraire".)

○
The text in Figure 2 is too small to be read.about the concept set SIBLING, so it looks like the other concepts in the network shown (OLDER SIBLING, YOUNGER BROTHER, ...) are members of this set.However, if one queries the interface, OLDER SIBLING and YOUNGER BROTHER are themselves concept sets.It turns out that "concept set" does not refer to sets of concepts in Concepticon, but to sets of concepts defined in the primary datasets.Please explain and illustrate how you use the following terms: "concept", "concept set", "concept label", "concept list".Also explain how these terms relate to words as commonly defined by linguists, and what the place of words is in the Concepticon (including improving the claim that "concept lists look like a list of words.However, words represent concepts in the mind" --check e.g.G. Murphy's chapter on word meaning in The Big Book of Concepts).The reviewer is correct, and we noticed that the terminology used in linguistics does not always match the terminology used in psychology.Therefore, some translation work is required.To avoid confusion about the terminology and abbreviations used in our manuscript, we have included a glossary (Table 1) of relevant terms and their definitions.
Also provide information about the ontological categories: how was the set of categories defined, and how were they assigned to specific concepts?

○
The ontological categories are based on the categories used by the World Loanword Database (Haspelmath and Tadmor 2009).However, they were modified regarding the concrete terms used, in order to make clear that we do not talk about concrete parts of speech since we are dealing with concepts and not with words (see List et al. 2016List et al. : 2394)).
The mapping between the primary datasets and Concepticon includes an automatic step: "the algorithm checks the elicitation glosses that are mapped to a given concept set in Concepticon and compares them with the word in a given list.(...) If there is a match, the word is mapped."Please explain how the checking and comparing is made, and what the criteria are to decide whether there is a match, at an adequate level of abstraction for the present report.Also, explain "elicitation glosses" --I thought they were definitions but Figure 1 suggests they are words.Maybe it depends on the dataset?○ We added a definition of "elicitation gloss" in the glossary (Table 1) at the beginning of the manuscript.We hope that the definition makes it clear that the automatic check compares word forms.The examples in the sentence should make the comparison more transparent.
"For this procedure, the algorithm checks the elicitation glosses that are mapped to a given concept set in Concepticon (e.g., tree, árbol) and compares them with the word in a given list (e.g., tree, forest, wood)." Also please provide information on how many datapoints from the primary sources are left unlinked at present (after the manual and automatic mappings you have done so far).

○
We added an example illustrating our decision to unlink a given concept.However, we would like to refrain from stating the number of unlinked data points.As the Concepticon and NoRaRe are constantly being developed, we add new links in each version and also update the links, for example, when a new concept set is introduced.We update our summary statistics with each release and the most recent numbers are provided here: https://github.com/concepticon/concepticon-data/tree/v3.0.0/concepticondataFurthermore, data points which are not linked can also be inspected on our web page directly: when looking up the concept set "0 ", one can see that we have 202 concept lists in which elicitation glosses are not linked to concept sets.These amount to more than 20,000 elicitation glosses (https://concepticon.clld.org/parameters/0).Given that Concepticon by now comprises 120,000 elicitation glosses in total, we succeed in linking more than 80% of the data.
"For example, if a list contains the elicitation gloss smoke without any further information about whether the verb or noun meaning is meant, we do not map it to one of the concept sets 778 SMOKE (EXHAUST) or 1689 SMOKE (INHALE).Elicitation glosses which are not assigned to a concept set are automatically linked to the concept set 0 NA." Please explain the data curation workflow better, clearly specifying the different steps before moving to a description of each.CLDF is presented in different places and I am still not sure I understand how it is used for the present datasets --again, examples would help.Also please distinguish clearly in the text between the format and the functionalities it affords.

○
We recognize the reviewer's concern about the structure and presentation of our descriptions.We, therefore, moved the description of the Python packages pyconcepticon and pynorare below "Manual and automated mapping" in the "Materials and Methods" section.In this way, the information about the CLDF format is not interrupted by the description of functionalities.Due to the scope of the Data Note, we cannot provide an extensive description of the workflows but we added a note at the beginning of the section "Manual and automated mapping" where interested readers can find more details.Furthermore, we added references to our blog posts in which we provide tutorials on how to use the Python packages (see also comments below and the comment by Reviewer 2).
"Detailed descriptions of our workflows are provided in the articles introducing Concepticon (List et al., 2016b) and NoRaRe (Tjuka et al., 2022a) as well as in the tutorials on our blog (Tresoldi, 2019a, Tresoldi, 2019b;Tjuka, 2020a).Here, we summarize the basic steps of the manual and automated mapping workflows.""Typing concepticon -h will give a list of the functionalities, for example, create_metadata which automatically creates the metadata JSON file.Guides on how to use the functionalities can be found on our blog (Tresoldi, 2019a, Tresoldi, 2019b;Tjuka, 2020a)." CLDF, CSVW, CLLD, ... it gets confusing, please help the reader.

○
We added all abbreviations in the glossary (Table 1) at the beginning of the manuscript.
Please also clarify why you release datasets in two formats (CLDF, another --which?).

○
We clarified the reasons for releasing CLDF versions of the data in the first paragraph of the section "CLDF datasets".
"With the major release of Concepticon Version 3.0 (List et al., 2022a)  Two examples: 1) In the description of the Python packages, details such as "Typing concepticon -h will give a list of the functionalities."Important: Make sure that the information that is relevant to understand the functionalities of the packages remains.

○
We added an example of a functionality and further references to tutorials on our blog.
Typing concepticon -h will give a list of the functionalities, for example, create_metadata which automatically creates the metadata JSON file.Guides on how to use the functionalities can be found on our blog (Tresoldi, 2019a, Tresoldi, 2019b;Tjuka, 2020a)." 2) You include links to many specific files in the repositories.In the article, you should provide only the main links and the information that is relevant to understand what ○ they contain.For where to find which specific kind of information, you should rely on the documentation in the repositories.We thank the reviewer for the suggestion but we decided to keep the links in the manuscript.As mentioned in the comment above, the Data Note is intended to provide a reference point so that researchers find all relevant information in one place.Linking to the various files in the GitHub repository and documentation is a way for us to bundle the information that already exists.We hope that researchers can use the Data Note as an entry point to explore our resources in more detail.
Are you sure Table 1 is needed?If so, expand information you provide --e.g.line 2 is not understandable ("Concept lists that contain further annotations which exceed the complexity of ranks"?).

○
We included Table 1 (now Table 2) because the tags have been updated since the last publication in List (2018) and we wanted to incorporate the new tags to make it easier for researchers to find the data they are looking for.In addition, the table provides descriptive statistics and examples of lists available in Concepticon and NoRaRe.However, we recognise that some of the descriptions were not sufficient and improved them accordingly.
The way the information is presented in the current version makes it difficult for the reader to understand the material.I list here what struck me the most, but the article needs to be thoroughly revised in this respect.The abstract mixes specific considerations about the historical development of the dataset that is being presented with general motivation and with relevant aspects of the dataset.The line "The Concepticon is based..." talks about the Concepticon as if it were already familiar to the reader.

○
We revised the abstract according to the reviewer's suggestion.
Abstract: "Language comparison requires user-friendly tools that facilitate the standardization of linguistic data.We present two resources built on the basis of a standardized cross-linguistic format and show how the data is curated and extended.The first resource, the Concepticon, is a reference catalog for standardized concepts from linguistic research.While curating the Concepticon, we found that a variety of studies in distinct research fields collected information on word properties.However, until recently, no resource existed that contained these data to enable the comparison of the different word properties across languages.This gap was filled by the Database of Norms, Ratings, and Relations (NoRaRe), which is an extension of the Concepticon.Here, we present the major release of both resources -Concepticon Version 3.0 and NoRaRe Version 1.0 -which represents an important step in our data development.We show that extending and adapting the data curation workflow in Concepticon to NoRaRe is useful for the standardization of cross-linguistic datasets.In addition, combining datasets from different research fields enables studies grounded in language comparison.Concepticon and NoRaRe include lexical data for various languages, tools for test-driven data curation, and the possibility for data reuse.The first major release of NoRaRe is also accompanied by a new web application that allows convenient access to the data." In the introduction, after a 3-sentence general motivation, the article jumps into details without giving the general picture: "For this reason, a community-led initiative developed the Cross-Linguistic Data Formats (CLDF, Forkel et al., 2018) [...] The CLDF ○ p. 3, "preface" ==> "surface".

○
We changed the typo."On the surface, concept lists look like a list of words." p. 4, "The different contents of the lists", not clear what "the lists" refers to here.

○
We added further information to specify which lists were meant."The different contents of the concept lists in Concepticon are accounted for by giving them tags which make it more straightforward to search for a particular kind of data (List, 2018)."p. 5, "These correlation studies not only show interesting patterns arising from the data such as similarities in word frequencies across diverse languages (Tjuka, 2020b) but also illustrate that cross-linguistic comparable datasets for particular word properties such as sensory modality are not available yet" ==> ?

○
We paraphrased the sentence."For example, we compared word frequencies across English, German, and Chinese, and the results showed that word frequencies were similar (Tjuka, 2020b).
Another study compared sensory modality across English, Dutch, and Italian showing subtle differences in each sensory modality (Tjuka, 2022).Thus, the database can be used to compare word properties and reveal both cross-linguistic similarities and differences."p. 7, "NoRaRe also allows the mapping of typed data such as networks, which is an added benefit compared to the data in Concepticon": what do you mean with "typed data such as networks"?Allows, or has been done already?

○
We specified the data type."NoRaRe also includes mappings of typed data, for example, entries in Wikidata (www.wikidata.org)and BabelNet (www.babelnet.org),which is an added benefit compared to the data in Concepticon."p. 7, "the fact that CLDF datasets can be converted to relation databases turning relations between tables into foreign key constraints implies that valid CLDF datasets will have unique identifiers for each row of a

Figure 1 .
Figure 1.Illustration of the content of the Concepticon (List et al., 2016a) on the left and NoRaRe (Tjuka et al., 2022a) on the right.
The clld web application for Concepticon was already introduced with the first release of Concepticon Version 1.0 (List et al., 2016b) in 2016.The major release of Concepticon 3.0 (List et al., 2022a) and NoRaRe 1.0 (Tjuka et al., 2022b) brings a number of updates that affect the presentation of data.In the previous version of the Concepticon web application, different kinds of metadata on word frequency, concreteness ratings, links to WordNet (Miller et al., 1990), etc. were represented in a box beside the elicitation glosses linked to a given concept set.

Figure 2 .Figure 3 .
Figure 2. Replacement of the metadata box with a link to NoRaRe for each concept set.

Figure 4 .
Figure 4. Distribution of languages in the NoRaRe datasets.

○ p. 3 ,
"preface" ==> "surface".○ p. 4, "The different contents of the lists", not clear what "the lists" refers to here.○ p. 5, "These correlation studies not only show interesting patterns arising from the data such as similarities in word frequencies across diverse languages (

Table 1 . Glossary of relevant terms and abbreviations occurring in the text.
The table is adapted from Tjuka et al. (2022a).

Table 2 . Tags for the Concepticon concept lists (List et al., 2022a).
(List, 2018) table are repeated from a blog post(List, 2018).Some lists receive multiple tags, so the total number is higher than the number of lists in the Concepticon.
13 (Dunn et al., 2017)naming test A list designed for a naming test in neurology or psycholinguistics to assess the linguistic capability of children and adults.5 (Ardila, 2007)proto-language A list illustrating the concepts in a proto-language that can be reconstructed with high certainty.5(Bodt&List,2019)questionnaire A questionnaire for linguistic fieldwork.51(Buck,1949)ranked A list that shows items in a ranked order and has one column reflecting the rank.The ranks are based on, for example, phylogenetic analyses or analysis of borrowing frequencies.

Table 4 . Examples of Concepticon concept sets across the six ontological categories.
with string values from a controlled vocabulary.The corresponding CSVW datatype descriptions then serve as documentation of valid assumptions for data reuse, but also as specifications for data consistency checks, which are built-in to CLDF validation.Tjuka et al., 2022c).While the raw data is updated and improved consistently in the respective GitHub repositories, the CLDF datasets allow for the reuse and exploration of the data in a different way.The data in the CLDF datasets are represented in tabular format and the corresponding metadata in JSON format.The bundling of the data makes it possible to represent relations between the tables.The data model is available here: github.com/concepticon/concepticoncldf/blob/v3.0.0/cldf/README.md.By releasing the data in Concepticon and NoRaRe as CLDF datasets, tools such as csvkit (pypi.org/project/csvkit/1.1.1)can be used to process and analyse the data conveniently.Examples of how to use the Concepticon data from the CLDF dataset can be found under github.com/concepticon/concepticon-cldf/tree/v3.0.0/doc and for NoRaRe under github.com/concepticon/norare-cldf/tree/v1.0.0/doc.
(Forkel & List, 2020).CLDF's extensibility makes it possible to add custom tables to such a package transparently while keeping the semantics of the fully standardized parts of the data, like the table of languages, intact.But the strict consistency promises that CLDF makes and tools like pycldf (pypi.org/project/pycldf/1.29.0) enforce also serve as quality control during the data curation workflow to improve reusability.CLDF datasets follow the data model of relational databases, where tables can link to each other using foreign keys.This means that CLDF datasets can automatically be converted to relational databases to facilitate efficient data access.It also means that CLDF datasets follow the recommendations for "tidy" data (Wickham, 2014; Wilson et al., 2017), in particular by providing unique identifiers for each row of a table.In the context of NoRaRe, another feature of CLDF, or rather the underlying CSVW specification (pypi.org/project/csvw/3.1.3),also helps with quality control.CSVW metadata can specify datatypes for data in CSV files, and thus augment the "raw" text data with well-defined conversions to typed data.For NoRaRe, this turns out to be particularly important, because NoRaRe variables provide many different types of data, ranging from continuous numbers from a limited range to categorical variables,

patterns and language variation: Word frequencies across English
(Wilkinson et al., 2016)on Zenodo.In addition, they adhere to the Cross-Linguistic Data Formats Initiative (CLDF; seeForkel et al., 2018; https://cldf.clld.org),whichaimsat representing lexical data in straightforward tabular formats.As a result, the datasets are presented in a useable and accessible format.Overall, all datasets comply with the FAIR principles, according to which the data should be findable, accessible, interoperable, and reusable(Wilkinson et al., 2016).Finally, these CLDF datasets are also available on GitHub and serve as input for the CLLD web applications for Concepticon (concepticon.clld.org)and NoRaRe (norare.clld.org).
, German, and Chinese.In: Michael Zock, Emmanuele Chersoni, Alessandro Lenci, and Enrico Santus, editors, Proceedings of the Workshop on the Cognitive Aspects of the Lexicon.Barcelona (Online), Association for Computational Linguistics, 2020b; 23-32.Reference Source Tjuka A, Forkel R, List JM: NoRaRe.A database of cross-linguistic norms, ratings, and relations for words and concepts (Version 0.2).Max Planck Institute for the Science of Human History, Jena, Germany, 2021.
Providing an example where the described limitation made it impossible to create a mapping would be helpful for the reader.The authors make reference to csvkit (pypi.org/project/csvkit/1.0.7), a tool that enables the processing and detailed analysis of the data in Concepticon and NoRaRe.It is worth noting that a newer version (1.1.0) of this tool is available.For readers who are not acquainted with the Concepticon, the term "narrow concept set" may be unfamiliar.Could you please provide further explanation of the distinction between broad and narrow concepts (even though the content becomes obvious with the examples that follow)?
pages) p. 5: The authors write that "Concepticon currently has 3,914 concept sets with an average of 231.76 concept sets mapped in a given list."It would be informative to learn which lists have the highest and lowest coverage of concept sets.A table featuring the top 5 and bottom 5 lists would be sufficient.○ ○ p. 5: In the paper, the term 'elicitation gloss' is used.The authors can provide further clarification on the term (and the rationale behind its use).My understanding is that the term "elicitation gloss" is equivalent to "word" in NoRaRe, whereas in Concepticon, it is equivalent to "concept."This difference could potentially be confusing for researchers who intend to use both resources.As was done in Tjuka et al. (2022a), it would be helpful if the authors could elaborate on the distinction between concepts and word forms and the reasons why one could use them interchangeably.○ ○ p. 6: The authors write that "It is important to note that we try to map as many elicitation glosses in a list as possible while at the same time, improving the mappings by not mapping an elicitation gloss to a concept set if the meaning cannot be disambiguated."○ ○ p. 7: The URL for the Python package pyconcepticon (pypi.org/project/pyconcepticon/3.0.0) leads to the Concepticon collection, whereas the URL for the Python package pynorare ( pypi.org/project/pynorare/1.0.1) does not provide direct access to the NoRaRe collection.○ p. 7: ○ P 7: ○ or 1689 SMOKE (INHALE)."p.7:The authors write that "the fact that CLDF datasets can be converted to relation databases turning relations between tables into foreign key constraints implies that valid CLDF datasets will have unique identifiers for each row of a table.Could you offer additional clarification on the term "foreign key constraints" as used in this context?As far as I understand, it pertains to a set of rules that ensure that the values in one table correspond to the values in another table.Nevertheless, it may be beneficial to provide further explanation here.We updated the link to the newest version of csvkit as suggested by the reviewer."Toolssuchas csvkit (pypi.org/project/csvkit/1.1.1)allowtheprocessing and in-depth study of the available data in Concepticon and NoRaRe."P7:For readers who are not acquainted with the Concepticon, the term "narrow concept set" may be unfamiliar.Could you please provide further explanation of the distinction between broad and narrow concepts (even though the content becomes obvious with the examples that follow)?We added further information on the broader/narrower concept distinction."Therelations broader versus narrower are used to indicate concept sets that refer to specific parts of a more general concept.For example, we established the broader concept set 3626 KNOW to relate it to the narrower concept sets 1410 KNOW (SOMETHING) and 2248 KNOW (SOMEONE) because we found that several concept lists did not provide a clear distinction between the two concepts.It is important to note that the Concepticon does not provide a full ontology but rather that the relations are established bottom-up and mainly serve as an exploration of the data instead of providing a basis for inferences."p.7: The authors write: "1,262 BROTHER, i.e., 559 BROTHER (OF MAN), 560 BROTHERThe commas were included in the typesetting of the manuscript.We thank the reviewer for noticing them.We deleted all additional commas in the identifiers of concept sets in the revision."For Concepticon, queries could include listing the Concepticon concept sets in a given concept list or showing concept set relations such as plotting all the narrower concept sets connected to 1262 BROTHER, i.e., 559 BROTHER (OF MAN), 560 BROTHER (OF WOMAN), 1759 OLDER BROTHER, 1760 YOUNGER BROTHER, etc." p. 8: The authors may refer to the newer version of the clld toolkit (clld 10.0.0) ○Reviewer 1 had similar concerns about the sentence so we paraphrased it to make the terminology clearer."CLDFdatasetsfollow the data model of relational databases, where tables can link to each other using foreign keys.This means that CLDF datasets can automatically be converted to relational databases to facilitate efficient data access.It also means that CLDF datasets follow the recommendations for "tidy" data (Wickham 2014; Wilson et al., 2017), in particular by providing unique identifiers for each row of a table."p.7:TheURL for the Python package pyconcepticon (pypi.org/project/pyconcepticon/3.0.0) leads to the Concepticon collection, whereas the URL for the Python package pynorare (pypi.org/project/pynorare/1.0.1) does not provide direct access to the NoRaRe collection.○Wethank the reviewer for the thorough check of our link collection.We decided to reference a link to the PyPI page rather than GitHub for the Python packages.Now that data access works via CLDF, pynorare and pyconcepticon are only implementation details.We have stopped archiving releases on Zenodo for Python packages since PyPI is the authoritative archive.That is why we no longer make releases on GitHub (which only served to move the Python package to Zenodo).p. 7: The authors make reference to csvkit (pypi.org/project/csvkit/1.0.7), a tool that enables the processing and detailed analysis of the data in Concepticon and NoRaRe.It is worth noting that a newer version (1.1.0) of this tool is available.○○ ○

the rationale for creating the dataset(s) clearly described? Yes Are the protocols appropriate and is the work technically sound? Partly Are sufficient details of methods and materials provided to allow replication by others? No Are the datasets clearly presented in a useable and accessible format? Yes Competing Interests:
No competing interests were disclosed. ○Is

have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.
Forkel et al. (2018)s updated and improved consistently in the respective GitHub repositories, the CLDF datasets allow for the reuse and exploration of the data in a different way.The data in the CLDF datasets are represented in tabular format and the corresponding metadata in JSON format.The bundling of the data makes it possible to represent relations between the tables.The data model is available here: github.com/concepticon/concepticon-cldf/blob/v3.0.0/cldf/README.md.By releasing the data in Concepticon and NoRaRe as CLDF datasets, tools such as csvkit (pypi.org/project/csvkit/1.1.1)canbeusedtoprocessandanalyse the data conveniently.Examples of how to use the Concepticon We added a footnote in which we refer readers interested in the major ideas behind CLDF and how they contrast with TEI to our original publication introducing CLDF from 2018(Forkel et al. 2018).Footnote: "We refer readers interested in the major design principles behind CLDF and the contrasts with other standardization efforts to the original publication introducing CLDF byForkel et al. (2018)."Pleaseremovetechnical details that belong in the documentation, not the article.We thank the reviewer for the suggestion but we decided to keep relevant technical details in the manuscript.The reason for the decision is that we consciously choose the category "Data Note" instead of "Research Article" so that we could provide a text that includes all necessary information in one place.While we also provide documentations on GitHub, we noticed that researchers tend to rely on published papers.Further details on how to use specific functionalities can also be found on our blog Computer-Assisted Language Comparison in Practice (https://calc.hypotheses.org/).The data note was intended to bring these different resources together and provide a reference point for researchers who are interested in using our resources.The links to the documentation in the GitHub repository ○ ○ ○ https://github.com/concepticon/norare-data/blob/master/RELEASING.md○ 4.
table": please rephrase in non-technical terms and explain why this is good.We rephrased the sentence with less technical terminology."CLDF datasets follow the data model of relational databases, where tables can link to each other using foreign keys.This means that CLDF datasets can automatically be converted to relational databases to facilitate efficient data access.It also means that CLDF datasets follow the recommendations for "tidy" data (Wickham 2014; Wilson et al., 2017), in particular by providing unique identifiers for each row of a table."p. 7, is "careless" really what you mean here?We deleted the adjective from the sentence."Especially, if a new concept list or improvements on existing data are added, the tests that run with the command concepticon test can spot inconsistent mappings, missing files, incorrect numbering of the concept sets, and many more mistakes."No competing interests were disclosed.