A comparative wordlist for the languages of The Gran Chaco, South America

Home to more than twenty indigenous languages belonging to six linguistic families, the Gran Chaco has raised the interest of many linguists from different backgrounds. While some have focused on finding deeper genetic relations between different language groups, others have looked into similarities from the perspective of areal linguistics. In order to contribute to further research of areal and genetic features among these languages, we have compiled a comparative wordlist consisting of translational equivalents for 326 concepts — representing basic and ethnobiological vocabulary — for 26 language varieties. Since the data were standardized in various ways, they can be analyzed both quantitatively and qualitatively. In order to illustrate this in detail, we have carried out an initial computer-assisted analysis of parts of the data by searching for shared lexicosemantic patterns resulting from structural rather than direct borrowings.


Introduction
The Gran Chaco is a South American eco-region that extends through north-central Argentina, eastern Bolivia, western Paraguay and southern Brazil.It is located north of the Salado river, east of the Andes mountains, south of the Amazon, from which it is separated by the Chiquitania, and west of the Paraguay and Paraná rivers.Apart from languages that have entered the region through conquest and colonization, such as Spanish, German and Paraguayan Guaraní, the region is home to indigenous languages of six different families: Guaicuruan, which includes Toba, Western Toba, Pilagá, Mocoví, Kadiwéu and extinct Abipón; Matacoan or Mataguayan, which includes Wichí, Maká, Nivaclé, and Chorote; Enlhet-Enenlhet, which includes Enlhet, Enxet, Enenlhet, Guaná, Sanapaná and Angaité; Zamucoan, which includes Ayoreo and Chamacoco; Lule-Vilela, which includes only Lule and Vilela; and Tupi-Guarani, which in the Gran Chaco includes Tapiete, Ava, and Guaraní Izoceño but which also extends all through South America (Campbell & Grondona, 2012;Durante, 2018;Fabre, 2005;Golluscio & Vidal, 2010).For many of these languages there are also different geographic varieties.
The linguistic diversity of the Gran Chaco and the striking similarities in the features of some apparently unrelated languages have attracted the attention of numerous linguists, who have approached the topic from various theoretical and methodological frameworks.On the one hand, much research has focused on genetic relations among the languages.Recently, for instance, it has been stated that Vilela and extinct Lule are related and the family has been named Lule-Vilela (Viegas Barros, 2001), or that Guaicuruan and Matacoan languages have a common genetic origin and belong to one family, termed Guaicuruan-Matacoan (Viegas Barros, 1993;Viegas Barros, 2013a).
Previous work had proposed even greater language family groupings (Kaufman, 1990;Mason, 1950).On the other hand, similarities among Chaco languages, not only Guaicuruan and Matacoan, have been analysed from the perspective of areal linguistics.Such similarities include phonological traits such as the presence and absence of certain phonemes, as well as grammatical features like the presence of possessive classifiers and noun determiners (Comrie et al., 2010).
Fewer studies, however, have focused on shared semantic features that are visible in the lexicon in the form of similar lexical motivation patterns (Campbell & Grondona, 2012;Messineo et al., 2010).In that sense, we consider that a big-scale dataset for further comparison of the Gran Chaco languages is a necessary tool that we have been lacking.Even though there have been many valuable works that compare different languages of the region, some of the criteria are inconsistent, and they seldom deal with the entirety of the indigenous languages of the Gran Chaco in a human and machine-readable way.Such an enterprise should be a starting point for a project that includes genetic comparison and concrete investigation of both lexical and pattern borrowing across Chaco languages of different families.

Materials
Two different datasets were first individually compiled and later combined for this study.The first one comprised a list of 502 concepts reflecting basic vocabulary terms translated into 23 language varieties spoken in the Chaco area and two language varieties from other regions.The second one consisted of 825 ethnobiological concepts translated into 16 Chaco varieties.While the coverage for the basic dataset was rather high, with most languages showing word forms for 80% and more of the data, the coverage for the ethnobiological dataset was rather low, since the terms are highly specific and it was often difficult to find translations for all terms in resources available for the respective varieties.In order to allow for a more targeted comparison of the languages with respect to lexical structures, we then decided to combine them.This decision was motivated by the fact that -although previous research showing interesting cases of pattern borrowing in flora and fauna vocabulary had sparked our interest in that domain -we realized that the lexical motivation for the formation of individual terms still depends to a large degree on words and morphemes that can primarily be found in the realm ofbasic vocabulary.Thus, a combined list, albeit imperfect, permits a detailed study on pattern borrowing while taking lexical motivation patterns into account.For this purpose, we selected 224 concepts from the basic vocabulary lists, and 100 ethnobiological concepts, resulting in a total of 324 concepts for 23 language varieties (see Table 1), which are geographically distributed across and around the Chaco area (see Figure 1).
The collection of basic words was compiled from various sources, mainly dictionaries, but in some cases also from grammatical descriptions.One of the largest contributors was the Intercontinental Dictionary Series (IDS), (Key & Comrie, 2021).

Amendments from Version 1
In this revised version, we have not modified the data, but rather tried to take the suggestions of the reviewers into account.As a result, the text contains some additional paragraphs in which we try to be a bit more transparent regarding the shortcomings of the data collection procedure we used in order to collect this dataset.
Any further responses from the reviewers can be found at the end of the article Column F refers to the forms in the data, column C refers to the concepts that are covered, columns B and E refer to the number of concepts covered from basic and ethnobiological vocabulary, and column Co refers to the coverage (number of attested concepts divided by number of concepts in the whole wordlist).
Other material came from individual sources available for the respective varieties, mainly dictionaries, wordlists, and compilations of different Chaco languages.In these cases, translational equivalents for the basic words were carried out manually.The collection of ethnobiological terms was typically compiled from specific lists of ethnobiological vocabulary, taken from articles and books dedicated to the topic, but in some cases, unified resources for basic vocabulary and ethnobiological terms were available and could be used.

Methods
In creating our resource, we had two major goals in mind.On the one hand, we wanted to create a resource that is both humanand machine-readable at the same time, allowing us to analyse the data and annotate particular findings step by step in future work (this process is ongoing work and might be featured in studies to be published in the future).On the other hand, we wanted to create a resource that can be easily compared with other lexical resources, both on a world-wide and a regional scale.This allows us to make use of additional information or to compare our findings with those reported for other areas of the world in our future work.In order to achieve the first goal, we used an internal representation of the data for analysis and annotation, based on the Etymological Dictionary Edictor (EDICTOR, Version 2.0, List, 2021a), in which we curate the data manually, annotating the data for various aspects, such as cognacy, borrowings, or borrowed patterns (loan translations) shared across the Chaco languages.In order to achieve the second goal, we converted our data to Cross-Linguistic Data Formats (CLDF, Forkel et al., 2018), using the Lexibank workflow for the curation of lexical data in CLDF (List et al., 2022a).
While data curation and annotation with the help of the EDICTOR tool were largely done in a manual fashion, the conversion to CLDF was mostly done automatically, providing additional steps that helped us to identify potential problems in our data.

Data curation with EDICTOR
Basic vocabularies and ethnobiological vocabularies were first collected separately.Only later, when we realized that both can be better analyzed in combination, we decided to combine them.For this purpose, we decided for a combined list of 324 items, with 224 basic vocabulary items and 100 ethnobiological items in total.Both datasets were combined to form a single TSV file in the format required by the EDICTOR tool and converted to an SQLITE database, using the PyEdictor package (List, 2021b, Version 0.4), which we use to allow for the convenient online editing of the data.
Our main intention for the analysis was to annotate structural borrowings, that is, cases of borrowings in which it is not the word form that is being transferred, but rather the lexical motivation by which certain objects can be denoted.As an example, consider the English term "(computer) mouse", which is reflected as ratón de computadora (literally "mouse or rat of the computer") in Spanish.
In order to annotate structural borrowings in the Chaco data, we made use of existing annotation schemes that were developed for the handling of partial cognates (Hill & List, 2017) and later extended to handle more complex cases of language-internal cognates and semantic shift (Schweikhard & List, 2020) and ultimately implemented in Version 2.0 of the EDICTOR tool (List, 2021a).The main idea of these annotation schemes is to provide what we call 'morpheme glosses' for each word form in the data and combine these with identifiers for partial cognates (see List et al., 2016).
As an example, consider the words for "beak" and "lip" for Maká and Chorote (both from the Matacaon language family) and Pilaga (from the Guaicuruan family) in Table 2.As can be seen from the table, all three language varieties express the word for "beak" by using the entire word or a part of the word for "lip".Since Pilaga is not related with Chorote and Maká, and the form that expresses the concept "lip" in Pilaga ([a s e p], according to our annotation) is not cognate with the form [p a s] in Chorote and Maká, we assign these forms different cognate set identifiers (2 for [a s e p] and 4 for [p a s]).But since we judge the pattern as identical, consisting of a possessive marker (marked as :poss in our morpheme glosses) and the reuse of the form "lip" to denote the concept "beak", we assign them the same pattern identifier, indicating that we have a shared structure here.Whether this structural commonality is due to language contact or due to independent processes of lexical change cannot be said at this point, since the pattern annotation is work in progress and has not been done for all of the data.Assembling more of these patterns in our data, however, will eventually allow us to find out whether these scenarios might result from contact or not.
Table 2 shows words for "beak" and "lip" across three varieties from two language families.While word forms are not cognate across the two language families, and also not borrowed directly, we find structural similarities with respect to the motivation.In all three varieties, our annotation assumes that the word for "beak" is derived from the word for "lip".We indicate this structural commonality with the help of identifiers that reflect the abstract structure (column Structure) and with the help of morpheme glosses, that provide an analysis of the underlying motivation (column Morpheme Glosses).Note that our analysis is not the only possible one for the given data.One could likewise argue or speculate that the word for "beak" was primary and that the word for "lip" was derived from it.In this case, the morpheme glosses would have to be modified.In order to avoid being forced to make a decision on the primary word form, one can -finally -also use neutral morpheme glosses like "beak/lip" which would explicitly avoid to make any judgment regarding primary or secondary word forms in the data.

Data Sharing with CLDF
Whenever substantial changes to the data have accumulated and we decide to release a new version, we export the dataset and convert it automatically to CLDF.In doing so, we carry out several consistency checks of the data and make sure that the individual datapoints are maximally comparable across datasets from different sources.The CLDF conversion is carried out with the help of the CLDFBench toolkit that offers a command line interface that facilitates the conversion of language data to CLDF formats (Forkel & List, 2020, https://pypi.org/project/cldfbench).Since we are working with lexical data, we additionally use the PyLexibank plugin for CLDFBench (Forkel et al., 2021), which offers extended functionality (see List et al., 2022a).The conversion to CLDF makes sure that our concepts are regularly linked to the most recent version of the Concepticon reference catalogue (List et al., 2022b), that all languages, where possible, are linked to Glottolog (Hammarström et al., 2022), and that the transcriptions follow the standards proposed by the Cross-Linguistic Transcription Systems reference catalogue (List et al., 2021).Since the CLDF standard currently does not (yet) offer standards to annotate structural borrowings, we define custom formats for now (see Table 2), which we will propose for the inclusion in future versions of CLDF.In the following, we discuss the integration of our data with the three reference catalogs of (Concepticon, Glottolog, and CLTS) in more detail.

Concept linking.
The concept list underlying our study was linked to the Concepticon reference catalogue (Version 2.6, List et al., 2022b).Concepticon offers unique identifiers for various concepts that are frequently used in questionnaires for language documentation and historical language comparison.Since Concepticon is by now more and more often used as a common standard reference for lexical datasets, also underlying large collections such as the Database of Cross-Linguistic Colexifications (CLICS) (Rzymski et al., 2020) or the Lexibank repository of standardized wordlists in CLDF formats (List et al., 2022a), we also made sure to link the concepts in our data to Concepticon, where possible.For the very specific plant and animal names in our data, however, the Concepticon does not offer concept identifiers.Here, we therefore linked our data to the Global Biodiversity Information Facility (GBIF).

Language mapping.
Another way of linking the data with already existing sources consists in the linking of language varieties to the Glottolog project (Hammarström et al., 2022).Glottolog provides unique identifiers for several language varieties, including dialect points and ancient varieties along with additional information regarding the language families to which the respective languages belong.For two varieties in our data, no Glottocode could be found.These are Manjui, which is a variety of Chorote spoken in the territory of Paraguay, and Toba de Cerrito, also spoken in the Paraguayan Chaco.These have not been identified as separate varieties on Glottolog yet, but might be added in future versions.
Most of the languages in our dataset are spoken in the Gran Chaco region of South America, in the territories of Argentina, Bolivia, Brazil, and Paraguay.In addition, we have chosen three languages spoken in adjacent regions, which we hope to use as control cases in future analyzes, namely Mapudungun (Araucanian), spoken in southern Chile and Argentina, Mbyá (Tupí-Guaraní), spoken in Argentina, Brazil and Paraguay, and Quichua Santiagueño (Quechuan), spoken in north-central Argentina.Although we are aware that these languages are spoken in the vicinity of the Gran Chaco, their inclusion as control languages responds to the fact that we intend to find shared semantic patterns that are not even found in adjacent territories.However, while some patterns have been observed in our data only in the Gran Chaco languages, others do appear also in the control languages.While it is true that areal influence does not end abruptly, and thus those coincidences could also be due to language contact exceeding the limits of the Gran Chaco, this could also be explained by the fact that not all shared semantic patterns are equally ubiquitous, with some patterns being more likely shared due to common typological traits in the world's languages.This point, and the need for a hierarchy on pattern borrowing in order to rank the evidence by strength, is discussed in the conclusion.Even so, future studies should include control languages spoken in additional locations (in and out of South America) in order to render the results more robust.Finally, Paraguayan Guaraní is usually not considered a Chaco language in origin, but it has an undeniable influence on indigenous communities of the Gran Chaco, especially in the territory of Paraguay, where it is the second and sometimes the first language of many indigenous people who are multilingual in other languages.
When searching for the translational equivalents of individual concepts in our concept lists in the different sources for the varieties we included in our sample, it is often difficult to decide which word corresponds best to a given concept, specifically in cases where one has to choose from several variants.Variants may result from several reasons.On the one hand, two translations for the same concept may correspond to different varieties that have been included in the same resource.For example, we have added a document for a variety of Toba spoken in Paraguay, Toba de Cerrito.However, this variety has two subvarieties, one spoken in the village of Rioverde and the other spoken in the village of Rosario.In those cases in which these subvarieties display different forms, we indicate in a comment which form corresponds to which variety.In future versions of the database, we plan to find more principled ways of handling this kind of dialectal variation.On the other hand, different resources may give different forms for the same concept but no indication in which regard the forms differ (e.g., regarding their usage, specific semantic nuances, etc.).In these cases we indicated the different sources in our comments, but hope to find a more principled way to handle these cases of variation in future versions of our database.
This study includes Lule and Abipón, two extinct varieties of which no speakers are known to have survived until today.The original sources of these varieties were written by missionaries in the eighteenth and nineteenth centuries.Since transcription practices differed largely in the past, we cannot fully account for the accuracy of the transcriptions we used.Including the varieties in the study has proven useful, however, since it allowed us to check whether certain kinds of semantic patterns existed already 300 or 200 years before.
Phonetic transcriptions.After having compiled the vocabulary in the corresponding sheets, the forms were converted, into a broad version of the International Phonetic Alphabet, called B(road)IPA, the central transcription system underlying the five transcription systems provided in the CLTS reference catalog.For the initial conversion, we made use of orthography profiles (Moran & Cysouw, 2018), which are integrated into the Lexibank workflow for the curation of lexical data, which we used for our study (List et al., 2022a).In this workflow, original forms are preserved, and for the target phonetic transcriptions used for cross-linguistic comparison, automatic tests are carried out to make sure they only reflect sounds defined in the CLTS reference catalog.
The conversion of transcription systems used by individual scholars to standardized transcriptions that conform to CLTS can be considerably tedious, especially when different transcription systems are underlying the data from every source.The conversion therefore required an intensive study of the phonological descriptions of all language varieties in our sample, for which often information often could only be found in broader grammatical descriptions.Inspecting the data also revealed that our initial conversion to phonetic transcriptions with orthography profiles was at times not optimal or contained occasional errors, which we then had to refine manually by modifying the data in the EDICTOR application.For the two extinct languages in our collection, Lule and Abipón, no reliable phonological descriptions available.In the case of Abipón, we followed the description of on phonology in Viegas Barros (2013b), based on comparison with other Guaicuruan languages.For Lule, we followed Zamponi's analysis from 2008.

Implementation
Having set up the data in its current form, our workflow for data curation and analysis now consists of two steps.In a first step, the data is analyzed using the EDICTOR tool. Figure 2 shows how the data appear in the Wordlist panel of the EDICTOR interface.
In order to share the data publicly, we then used the Lexibank workflow (List et al., 2022a) to convert the data automatically into Cross-Linguistic Data Formats, which can be triggered from the commandline.The conversion automatically checks various aspects of the data, including the transcriptions as reflected in a given version of the CLTS reference catalog, the mapping to a given Glottolog version and a given Concepticon version, and the formal correctness of currently available annotations.

Conclusion
Although we consider the collection of the dataset reported here as preliminary, it has reached a stage where we can start with the concrete analysis of individual patterns in the data (Brid et al., 2022).In the future, we plan to enhance the current dataset further and also extend the annotation of cognate words and structural borrowings.
Although we consider the dataset as good enough to publish it at this point, we should make clear that we are not fully content with all decisions we undertook in the past when collecting our data.By explicitly pointing to these points of dissatisfaction, we hope that we can warn readers of this study to avoid our mistakes when conducting similar works.
Firstly, we warn future researchers against mixing multiple sources for the same language varieties with no overt indication.For instance, our Chorote, Wichí, and Ayoreo data come from different sources.Although it may be important to include multiple sources, it would be advantageous to include a reference to the source in the database, perhaps in a separate column.This would make a discussion of the data and the underlying decisions which led to their creation more transparent.Also, it may turn out that a source differs from another source because it is based on a different language variety, perhaps more in contact with another language of the region.
In suchhh a case, having that information at one's disposal would be highly relevant for the results.
Even if sources are overtly indicated, a future reader would have to find the entries in the source.However, at present our data is not visible in its original orthography.For that reason, we encourage similar projects in the future to keep the original transcription in a separate column.This would enable users to copy-paste the original form in order to look it up in the original source.We plan to solve these two issues in the future, but at this stage, our data curation process had advanced too much to allow us for handling these problems efficiently.
Finally, it would also be desirable to rank the evidence for borrowingby strength.This means that, in order to address the topic of areal influence on shared semantic patterns, one would like to be able to tell the difference between patterns that may be shared due to typological traits common to the world's languages and patterns that are more likely shared due to areal influence.This requires a theoretical and methodological apparatus that permits to suppose some kind of hierarchy on pattern borrowing.Since -to the best of our knowledge -such an apparatus does not exist at the moment, we can only hope on future research to provide us with additional tools to enhance the analysis of our datasets.

Reference Source
The word list consists of basic vocabulary (the largest part) and another part which is ethnobotanical vocabulary.The paper mentioned that the authors realized that these two parts were best combined into one.Can you say more about the reason behind this, why is it better to combine them?And did or do you have any expectations wrt borrowability of semantic structure of basic vocabulary versus more peripheral (ethnobotanical) vocabulary.When it comes to form borrowing, the received wisdom is that basic vocabulary is more resistant to borrowing than flora and fauna vocabulary, but is there an equivalent expectation for structural borrowing?2. Semantic coding I understand that, in the example given, the words for lip and beak are both assigned to the concept LIP.The important pattern here seems to be that words for beak and lip are connected to one and the same concept, and it wasn't entirely clear to me what happens if two languages have one and the same underlying concept for both lip and beak, but in slightly different ways.I'll sketch two hypothetical scenarios.One (admittedly unlikely) scenario is that there might be independent evidence that in fact BEAK is the original meaning and that it is more truthful to connect both the words for lip and beak to the concept BEAK.Would that count as a full mismatch with the languages that connect both words to the concept of LIP? Another scenario would be that the word for lip is in fact connected to the concept MOUTH, e,g, the word for lip might semantically be something like OUTER MOUTH, this in turn may be extended to the word for beak, but in this case both words are connected to a third concept MOUTH.Is that also a full mismatch with a language that extends LIP to the word for beak? 3. Types of semantic structural isomorphisms Related to the previous point, do I understand correctly that the obligatory presence of a possessive prefix with words for lip and beak in the example count as much for a match on structural borrowing as the fact that both words are connected to the concept LIP?It seems to me that one match is more indicative about past contact than the other 9in general: the more specific and unusual, the more informative).Or do you have ways to differentiate between different types of structural matches? 4.You mention a number of control languages.These are all spoken in the immediate vicinity of the Gran Chaco, and I don't think it can be excluded that there were contacts between speakers of the control languages and the target languages.So I wonder to what extent are these control languages.Are they meant to show a diminished number of commonalities, or are they meant to give a baseline of accidental commonalities?If the latter, I think the control language are not the best choices.In any case, it is good to make this clear.5.I wonder if it is not a little too early to publish this paper.The concluding remarks suggest that the coding scheme can still change considerably.I cannot tell how much it may change and to what extent it would make the present publication obsolete, so I leave this for the consideration of the authors.

Are the datasets clearly presented in a useable and accessible format? Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: My research interests include South American languages, reconstructing the social history in South America, language typology.I feel confident to assess the conceptual set up of the paper, but not to assess the technical details of the implementation.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
the introduction to the Materials section of the study, where we added more information on the advantages of the combined collection of basic vocabulary along with ethnobotanical terms.
"One (admittedly unlikely) scenario is that there might be independent evidence that in fact BEAK is the original meaning and that it is more truthful to connect both the words for lip and beak to the concept BEAK.Would that count as a full mismatch with the languages that connect both words to the concept of LIP? Another scenario would be that the word for lip is in fact connected to the concept MOUTH, e,g, the word for lip might semantically be something like OUTER MOUTH, this in turn may be extended to the word for beak, but in this case both words are connected to a third concept MOUTH.Is that also a full mismatch with a language that extends LIP to the word for beak?" ○ This is a very good remark which emphasizes the importance of taking historical pathways of semantic change into account when trying to match patterns of lexical motivation.We agree that one could definitely argue that the underlying patterns have different origins, while our current annotation practice points to a very specific direction of change.In order to avoid this, however, we can also employ an annotation of the partial colexification patterns that does not make any decisions regarding the direction of semantic change and lexical motivation processes.In such an annotation, we would leave it open, which form (BEAK or LIP) we take as the primary one, and we would indicate this by using a gloss BEAK/LIP in both cases.While this is a very simple solution to account for the problems raised here, it is clear that it may not be satisfying.However, we assume that the reviewer will agree with us that it is in any case difficult to judge which direction of change would be more probable.We have added a statement in our example that emphasizes that there are different solutions than the ones we propose and which also points to the "neutral" solution of morpheme glossing as an alternative."Related to the previous point, do I understand correctly that the obligatory presence of a possessive prefix with words for lip and beak in the example count as much for a match on structural borrowing as the fact that both words are connected to the concept LIP?It seems to me that one match is more indicative about past contact than the other (in general: the more specific and unusual, the more informative).Or do you have ways to differentiate between different types of structural matches?" ○ This is a very good point which we have not really thought through so far.It is clear that in theory, one should be able to rank the evidence.In this way, one could distinguish more surprising types of structural matches from less surprising ones and use this to indicate which one we consider as more likely than the others.However, at this stage in our analysis, where we are still trying to figure out the most transparent ways to analyze the data, we are not able to provide [4] We have, however, added a short paragraph in the final outlook of our study, where we indicate that it would be desirable to a) rank the evidence by strength, and to b) come up with some kind of a hierarchy on pattern borrowing that could guide the ranking process.any solutions for the ranking or for a systematic comparison of different types of commonalities and their respective force to provide strict evidence for pattern borrowing.
[4] We have, however, added a short paragraph in the final outlook of our study, where we indicate that it would be desirable to a) rank the evidence by strength, and to b) come up with some kind of a hierarchy on pattern borrowing that could guide the ranking process."You mention a number of control languages.These are all spoken in the immediate vicinity of the Gran Chaco, and I don't think it can be excluded that there were contacts between speakers of the control languages and the target languages.So I wonder to what extent are these control languages.Are they meant to show a diminished number of commonalities, or are they meant to give a baseline of accidental commonalities?If the latter, I think the control language are not the best choices.In any case, it is good to make this clear."We agree that the control languages are not a good choice to serve as a baseline for chance commonalities.Instead, the hope was to show that the closeness of the languages in the Chaco area leads to more commonalities between Chaco languages than with languages which are still spoken in South America but not in direct contact (thus corresponding to the first scenario mentioned).Adding control languages that might serve to illustrate accidental commonalities is an idea that we should discuss in the future.[5] For now, we have tried to clarify that the control languages in the current study were included as examples of South American languages that are not spoken in the Chaco, in order to see to what degree the possibility of a Sprachbund in the Chaco area might have eased the large amount of pattern borrowings that can be found there.
In general, sufficient detail is provided in the text and in the references cited to replicate the workflow.However, one issue that is worth mentioning is that some forms are attributed to multiple sources, e.g. the entries for Ayoreo and Chorote.When combined with the absence of page numbers for where particular forms are located in their respective sources, and an absence of the original orthographic transcription of the form in its source, identifying the provenance of certain forms is somewhat more difficult than need be.The presence of the original transcription of the forms in the source material would further make the phonological retranscription procedure carried out by the authors more transparent and replicable.It may be advantageous to consider each record in the dataset as a particular instance of a documented form attributed to a particular semantic concept in a particular source, a 'docunym'.While this may produce multiple forms attributed to the same concept for a single language, adding further complexity to computational work, this would address some of the issues discussed by the authors regarding having "to decide which word corresponds best to a given concept" across different sources and different language varieties.The dataset as presented is useable and accessible to the target user, either as a reference for lexical information on the languages of the Gran Chaco or as a starting point for comparative analyses.Furthermore, the protocols adopted for the creation of this dataset make use of a suite of workflows, tools and reference catalogs that are not only appropriate, but help to define a standard for comparative lexical work within the field.

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound?Yes

Are sufficient details of methods and materials provided to allow replication by others? Partly
Are the datasets clearly presented in a useable and accessible format?Yes Competing Interests: No competing interests were disclosed.
Reviewer Expertise: My areas of research are the documentation and description of South American indigenous languages, language typology, and historical linguistics.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Figure 1 .
Figure 1.Languages covered in our study (with exception of Mapudungun, which is located further in the South).

Figure 2 .
Figure 2. Curating the data with the help of the EDICTOR interface.The screenshot shows the Wordlist panel view of the EDICTOR tool.Word forms are rendered by coloring speech sounds according to their major sound class.

Las lenguas del Chaco desde la perspectiva de la semántica léxica. Análisis preliminar de patrones léxicos compartidos en el dominio etnobiológico [The languages of the Gran Chaco from the perspective of lexical semantics. Preliminary analysis of shared lexical structures in the ethnobotanical domain].
Toba dictionary.In: Key Mary Ritchie and Comrie, Bernard (eds.)The Intercontinental Dictionary Series Leipzig: Max Planck Institute for Evolutionary Anthropology.2021.