In recent years, there has been a surge of interest in tasks targeting Natural Language Understanding (NLU) and Reasoning. Most recently, Large Language Models (LLMs) such as ChatGPTFootnote 1 have received immense attention and have made such NLU tasks seem more tangible as ever. For the creation of these models many research efforts have focused on the creation of massive datasets and the training of huge, deep models reaching human performance, cf. ChatGPT,Footnote 2 PALM-2,Footnote 3 LLaMMA,Footnote 4 FalconFootnote 5 but also Liu et al. (2019), Pilault et al. (2020). The world knowledge encapsulated in such models and their robust nature enables them to deal with diverse and large amounts of data in an efficient way. However, it has been repeatedly shown that such models fail to solve basic human inferences and lack generalization power. When presented with differently biased data (Poliak et al., 2018; Gururangan et al., 2018; Kalouli et al., 2023) and smaller datasets with fewer or less diverse phenomena (Bender et al., 2021), or with inferences containing hard linguistic phenomena, (Dasgupta et al. 2018; Nie et al. 2018; Naik et al. 2018; Glockner et al. 2018; Richardson et al. 2020; McCoy et al. 2019; Bernardy amb Chatzikyriakidis 2019; Yanaka et al. 2020, to name only a few), they struggle to reach the baseline. Explicitly detecting and solving these weaknesses is only partly possible, e.g., through appropriate datasets, fine-tuning or appropriate prompting, because such models act like black-boxes with low explainability. At the same time, another strand of research has continued to target more traditional approaches to reasoning, employing some kind of logic or semantic formalism. Such approaches excel in precision, especially of inferences with hard linguistic phenomena, e.g., negation, quantifiers, modals, etc. (Bernardy and Chatzikyriakidis 2017; Yanaka et al. 2018; Chatzikyriakidis and Bernardy 2019; Hu et al. 2019; Abzianidze 2020, to name only a few). However, they suffer from inadequate world knowledge and lower robustness, making it hard for them to compete with state-of-the-art models. Thus, lately, a third research direction seeks to close the gap between the two approaches by exploring how the strengths of the two approaches can be combined and their weaknesses mitigated, e.g., through hybrid approaches.

Attempts to combine distributional and symbolic representations to tackle NLU tasks have been pursued in three main directions. One strand of research has used linguistic or formal semantic features as additional input to systems that create distributional representations, e.g., Padó and Lapata (2007), Bjerva et al. (2014), Levy and Goldberg (2014), Bowman et al. (2015), Chen et al. (2018). Another strand of research has attempted the opposite: to use distributional features as input to systems that create symbolic representations, e.g., May (2016), van Noord et al. (2018), Oepen et al. (2020). Both these research directions have laid their focus on one of the frameworks and have only used the other one in a complementary manner. The third research direction has attempted to lay an equal focus on the two frameworks by combining symbolic and distributional aspects in the final representation, e.g., Lewis and Steedman (2013), Beltagy et al. (2016), Kalouli et al. (2019), Krishna et al. (2022), marrying traditional reasoning paradigms with neural approaches, e.g., Liang et al. (2017), Ebrahimi et al. (2021) or aiming at explainable Artificial Intelligence (AI) (Calegari et al., 2020). We see such hybrid research efforts as promising not only to overcome the described challenges and advance the field but also to contribute to the symbolic-deep learning “debate" that has emerged in the field of NLU.

Indeed, hybrid approaches have been pursued in several sub-fields of NLU, such as Natural Language Inference (NLI), Question-Answering (QA), Sentiment Analysis and Dialog. Concerning NLI, recent research by Kalouli et al. (2020) proposes a hybrid approach where a trained classifier learns whether the symbolic or the deep learning component of the system should be trusted based on the nature of the pair, i.e., on whether it involves complex linguistic phenomena and thus requires precise reasoning or whether robustness and world-knowledge are necessary. Within medical NLI, Wu et al. (2019) present an approach of an ensemble model, based on one symbolic and two deep learning encoders. The symbolic encoder is a syntax encoder, capturing structural information of the sentences, while the deep learning encoders are responsible for converting the text into distributional representations and injecting domain knowledge into the model. The QA field has attracted similar interest in hybrid methodology. Yi et al. (2018) propose a neural-symbolic visual question-answering system, which first recovers a structural scene representation from the image and a program trace from the question and then executes the program on the scene representation to obtain an answer. Honda and Hagiwara (2019) employ a combination of deep learning models, Neural Machine Translation and Word2Vec training to learn the symbolic processing performed by a Prolog system and use it to build a QA system. Within the field of Sentiment Analysis, Hu et al. (2017) propose a framework that enhances various types of neural networks (e.g., CNNs and RNNs) with declarative first-order logic rules by transferring the structured information of logic rules into the weights of neural networks. They show that their approach is able to outperform the state-of-the-art in Sentiment Analysis and Named Entity Recognition. More recent research in hybrid Sentiment Analysis has been conducted by Cambria et al. (2020), who implement a new version of SenticNet (Cambria et al., 2018), a knowledge base used for sentiment analysis, by employing a top-down (symbolic) and a bottom-up (subsymbolic) approach. They use logic and semantic networks to encode meaning and deep learning architectures to implicitly learn syntactic patterns from the data. Kalatzis et al. (2016) and Eshghi et al. (2017) combine reinforcement learning with a symbolic dynamic model of syntax (Dynamic Syntax) and demonstrate the effectiveness of such approach in bootstrapping dialog data from very minimal data.

Against this backdrop of hybrid approaches in NLU, we have promoted this research direction and fostered fruitful dialog between the two disciplines by establishing the NALOMA (Natural Logic Meets Machine Learning) workshop series in 2020Footnote 6 and continuing it since then.Footnote 7 The workshop, which started out with a focus on NLI, aims to bring together researchers working on hybrid methods in any subfield of NLU, including but not limited to NLI, QA, Sentiment Analysis, Dialog, Machine Translation, Summarization, etc. The workshops have also attracted researchers working on one of the two disciplines but interested in moving into the hybrid direction. Topics that have been part of the workshops include: NLU systems that integrate logic-based/symbolic methods with neural networks, explainable NLU models, opening the “black box” of deep learning in NLU, downstream hybrid NLU applications, comparison and contrast between symbolic and deep learning work on NLU, etc.

With this special issue, we would like to put together extended versions of several selected contributions to the NALOMA series. Specifically, the issue contains the following four contributions.

Assessing the Strengths and Weaknesses of Large Language Models The opening paper by Shalom Lappin includes a clear, calm and insightful discussion of the strengths and weaknesses of Large Language Models (LLMs). It provides a very timely and a welcome contribution to the current, and potentially overhyped discussion on LLMs. The article carefully examines the arguments against the use of LLMs and takes a balanced stance according to which LLMs are far more than stochastic parrots, but at the same time, the question of whether these models have anything to say in the areas of human language learning and linguistic representation has been largely left unanswered.

Capturing the Varieties of Natural Language Inference: A Systematic Survey of Existing Datasets and Two Novel Benchmarks Gubelmann et al. specifically focus on the task of NLI. After providing an extensive survey on the types of inference (from a theoretical point of view) and on the current scenery of neural NLI models and datasets, they lay the foundations to discuss the problem of generalization of these models based on the theoretical notions of inference. Particularly, they use a dataset of the argumentative writing field to evaluate and criticize the strengths of neural NLI models on such kinds of inferences. In their second experiment, they create their own dataset, which focuses on quantifiers and deductively valid inferences. Again, they probe LLMs on these datasets and make conclusions about their capabilities in this area. Last, the authors explore options for fine-tuning and optimizing the models, also in comparison with a symbolic system of NLI.

Monotonicity Reasoning in the Age of Neural Foundation Models The paper by Chen and Gao presents three methods to tackle monotonicity reasoning using deep learning and large language models. The first approach utilizes a Tree-LSTM with syntactic tree structures and a multi-hop self-attention aggregator to classify natural language inference problems. The second approach represents a pipeline of rule-based and neural components. The NLI pipeline first detects polarities of words based on monotone operators and a sentence structure. Then, the search engine attempts to gradually rewrite a premise into a hypothesis. One of the components in the rewriting search is neural-based which detects paraphrases that are beyond monotonicity calculus. The third approach exploits LLMs, including GPT3.5, to classify monotonicity inference problems in zero- and few-shot learning experiments. The overall conclusion of the paper is three-fold: LLMs are far from mastering monotonicity reasoning, the underlying tree structures do help in classifying monotonicity inferences, and joint reasoning with symbolic and neural components can set state-of-the-art on monotonicity reasoning.

Monotonic Inference with Unscoped Episodic Logical Forms: From Principles to System Kim et al. propose a theoretical framework and its implementation for monotonicity inference with Unscoped Episodic Logical Forms (ULFs), where the latter is an Episodic Logic formula with unresolved scope, anaphora, and word senses. The implemented system is mainly a pipeline of rule-based components. The inference process represents a forward search from the premises to the hypothesis. The authors additionally extend the baseline system in three ways: (1) use lexical information from the hypothesis to better guide the forward inference process, (2) consider multiple possible scopings for a sentence, and (3) base matching of an obtained conclusion and the hypothesis on surface forms to abstract from possible parsing errors introduced in ULFs. Both baseline and extended systems are evaluated on the generalized quantifier section of the FraCaS NLI dataset. The results show that each extension separately improves the baseline but jointly yields an average improvement. This is partially due to (2) and (3) introducing wrong entailment relations.