Can We Automate Diagrammatic Reasoning?

Learning to solve diagrammatic reasoning (DR) can be a challenging but interesting problem to the computer vision research community. It is believed that next generation pattern recognition applications should be able to simulate human brain to understand and analyze reasoning of images. However, due to the lack of benchmarks of diagrammatic reasoning, the present research primarily focuses on visual reasoning that can be applied to real-world objects. In this paper, we present a diagrammatic reasoning dataset that provides a large variety of DR problems. In addition, we also propose a Knowledge-based Long Short Term Memory (KLSTM) to solve diagrammatic reasoning problems. Our proposed analysis is arguably the first work in this research area. Several state-of-the-art learning frameworks have been used to compare with the proposed KLSTM framework in the present context. Preliminary results indicate that the domain is highly related to computer vision and pattern recognition research with several challenging avenues.


Introduction
Diagrammatic reasoning involves visual representations of objects or diagrams. It involves understanding concepts and ideas from images consisting of patterns. Solving such diagrammatic reasoning problems using computer vision and artificial intelligence can help us to understand complex patterns of objects in images. Typically, a test in diagrammatic reasoning consists of questions that requires analyzing a sequence of shapes or patterns. This is also known as abstract or inductive reasoning test[]. The task is to identify the rules that can be applied to a sequence and then use them to pick an appropriate answer. The questions are usually of multiple choices. These questions generally consist of a series of pictures, each of which is different or oriented. The task is to choose another picture from a number of options to complete the series. For example, Figure 1 shows a typical diagrammatic reasoning problem, where the first row represents the question and the second row contains the four options out of which only one is correct. Figure 1: A typical example of a diagrammatic reasoning problem. The first row presents the first three objects of a sequence of four objects in a particular order. The second row presents the multiple choices typically shown to an examinee. Option D is the right answer for the above problem.

Related Work
Solving reasoning problems using artificial intelligence (AI) is a challenging task. For example, solving mathematical word problems [1] using natural language processing (NLP) is well-known in artificial intelligence. Solutions to them have enhanced the strategies of supervised learning by introducing newer rules. However, similar tasks in visual reasoning have not received focused attention of the computer vision research community. Two similar domains that have attracted computer vision and pattern recognition researchers are visual question answering [2,3,4] and visual reasoning [5,6,7]. Figures 2(a)

Motivation and Contributions
Diagrammatic reasoning can also be presented as a visual sequence prediction problem [10]. In computer vision, similar approaches are used to predict future video frames [11,12,13]. Following a similar line of thinking like visual reasoning in computer vision, we have introduced this new domain of research, namely solving diagrammatic reasoning using machine learning guided computer vision process. In this context, we have made the following contributions: • Solving diagrammatic reasoning problems with the help of computer vision and pattern recognition techniques.
• Introducing a rich diagrammatic reasoning dataset that can be used by the computer vision research community for solving similar problems through pattern recognition and machine learning.
• We also introduce a new learning framework referred to as Knowledgebased Long Short Term Memory (KLSTM) to solve diagrammatic reasoning problems.
Rest of the paper is organized as follows. In Section 2, we present the Datasets and Benckmarks. Section 3 presents the proposed DR solving method. Experiment results are presented in Section 4. Conclusion and future work are presented in Section 5.

Datasets and Benchmarks
The ultimate goal of visual reasoning is to learn image understanding and interpretations. Due to the unavailability of datasest and benchmarks, research in this domain is still in its infancy. There are a large variety of DR problems. For examples, Figure 3(a) represents a 2 × 2 and Figure 3(b) represents a 3 × 3 DR problem. The examples seem complex and we have found these examples are hard to solve through common machine learning frameworks. Such problems are left for future research. In this paper, we have considered a 4 × 1 diagrammatic problem, such as shown in Figure 1.
We have collected images of diagrammatic reasoning from the web and prepared a dataset of 4 × 1 diagrammatic reasoning problems. The dataset contains 619 number of problems. We have categorized these problems into four groups, namely (i) Rotation (RT), (ii) Counting (CT), (iii) Shape Scaling (SS), and (iv) Other Type (OT). Figure 4 depicts one sample question with possible answers from each category and Figure 5 depicts the distribution of the problems across various categories in our dataset.

KLSTM for Solving 4 × 1 DR Problems
The proposed DR solving method is based on a set of features and rules. We have introduced a supervised and ruled-based method to extract relational features (RF) of image sequences. The proposed method consists of two major steps. During the first-level of processing, the question and options are passed through a knowledge acquisition tool to construct the knowledge base. The knowledge consists of a set of image features extracted from individual image and a set of relational features extracted from the sequence of images in the question. Next, the problem type is identified using a rule-based method. Finally, the features are passed through a Knowledge-specific Long Short Term Memory (KLSTM) to predict the possible output pattern or image. Figure 6 depicts the proposed framework in details. The KLSTM network consists of (i) a Knowledge acquisition module, (ii) a set of LSTM, and (iii) a problem classifier and LSTM chooser module.
The problem space (P) is defined in (1), where the question contains a set of images (Q) = {I 1 , I 2 , I 3 } and the given options are grouped in another set (O) = {I 4 , I 5 , I 6 , I 7 }. Diagrammatic reasoning is to predict the answer image such that I answer ∈ O. First, we represent the problem using a high-level knowledge structure. This is carried out as follows. Domain knowledge of human experts (rules) are used to understand the relation among a sequence of images. Next, a knowledge base (K) is constructed. That expert opinions (rules) are integrated with the system to solve visual reasoning problems. The method is presented in Algorithm 1.
Knowledge Acquisition: Knowledge acquisition is carried out during train-   : Architecture of the proposed DR solving framework with various components. We take question sequence and the options as input and construct a knowledge base. Finally, it predicts the best possible option out of the four input options and produces a complete sequence of four patterns/images. The framework consist of a rule-base problem classifier and a set of LSTM similar to [14]. The input of the LSTM are relative features (RF).

Algorithm 1 Diagrammatic Reasoning
Input: Problem Space as defined in (1) Output problems used in training. First, the shapes in each image in P and number of similar shapes are extracted using YOLO [15]. YOLO is fast and the accuracy of the method is good enough in our context. We then introduce a new feature for solving digrammatic reasoning problems. The feature is referred to as the relational feature. Unlike image-based features such as color, texture, shape or edge that are typically used in various computer vision applications, we have extracted three relational features, namely rotation (ρ), counts (χ), and scaling (σ) from the set of the given images. The feature-set is given in (2). Various components of the feature-set (k) are described hereafter.
Shape Detection: Each image in P is passed through a deep learning module to extract the shapes. We have considered common geometrical shapes such as circle, triangle, rectangle, square, diamond, star, hexagon that are usually present in various DR problems. All the shapes are classified as either empty (only edges) or filled. YOLO [15] has been found to be a good binary classifier as compared to Resnet50/101 [16], VGG 16 [17], or GoogleNet [18].
Rotation: In a typical rotation diagrammatic reasoning problem (Figure 7), the solution lies in rotating the figure correctly to complete the sequence. We assume the first image (I 1 ) as the reference with a rotation of 0 • . All other images (I 2 , ..., I 7 ) are expressed using rotation angle with respect to the reference image. To achieve this, 360 number of images are generated by incrementally rotating Figure 7: We represent the rotation problem as a set of 7 images or patterns. In rotation problems, we consider the first image (red) as the reference image with 0 • rotation and extract the rotation relation of other images. the base image by 1 • . A few samples of the rotated images corresponding to the DR problem described in Figure 7 are shown in Figure 8. This set is denoted by R = {I 1 , I 2 , ..., I 360 }. The similarity score (ψ) is defined in (3). This score has been estimated between a query image and all images of R using ResNet50, where I j is query image and I k is the image in R.
The relative rotation ρ(I k ) of each image of P is then extracted with respect to each image I j belonging to R. If the images in P are different from each other, we categorize the question as non-rotation problem and a flag NA is set. A threshold has been used to decide about the success of matching. ρ(I k ) is set to the value of rotation if the matching score returned by ResNet50 is above the threshold. However, in the event of multiple images being categorized above the threshold, the image that gives the highest value, is selected and its rotation angle is taken as the final input. In the event that none is found suitable, the problem is categorized as non-rotation digametic reasoning problem. Counting: Counting is a reasoning problem where the solution is to extract the correct number of shapes present in the problem sequence. First, the shapes are detected and the number of same shapes is estimated. For example, Figure 9 depicts a typical filled circle detection and counting using YOLO. Each image of the problem space is expressed using the count of shapes in a sequence as {2, 4, 6, ?}. The predicted missing number is from the set {6, 8, 4, 10}. Scaling: Relative scaling (σ) is extracted from the bounding box of the de-tected shapes. First, the bounding boxes are extracted from the shapes in P . Next, similar sized shapes are grouped using unsupervised density-based spectral clustering with application to noise (DBSCAN) [19]. The groups are then rearranged in order of labels such that L 1 < L 2 ... < L n . These groups can be labelled as large, medium, small and tiny for a typical 4 × 1 DR problem. The process of grouping and labelling of shapes is described in Algorithm 2.  Representation of Knowledge Base: For a given problem space P , the shapes are detected and the relational features (RF) are extracted as mentioned earlier.
The knowledge base consists of four sets, namely shapes, rotation (ρ), counting (χ), and scaling (σ). Shapes store information about the structures and other sets represent various components of the relational features. Table 1 shows the knowledge extracted from four different 4 × 1 problems.  Classification and Solving: Final stage is to learn the pattern from the question images and predict the correct answer from the given options. At the beginning, the relational features (RF) are extracted from all training samples. Next, four independent LSTMs corresponding to ρ, χ, σ and unknown problems are trained to build the prediction model. In the testing phase, a similar knowledge base is extracted from the test sample. Next, a rule-based method as described in Algorithm 3 is applied to classify the problem as Category 1 (RT or CT or SS) or Category 2 (OT). In the case of Category 1, a variation of LSTM is used as proposed in [14]. The method has been used to generate a caption from the images. Rather than using the conventional image-based features [13,12], we have used relational features (RF) extracted by the knowledge extractor. The method is depicted in Figure 11, where the knowledge extractor (KE) is the process of extracting RFs and representor (R) is the image with a set of relational feature.
Unknown problems (Category 2) are solved by the variation of LSTM called Flexible Spatio-Temporal Network (FSTN) proposed in [11]. Originally the method predicts the future video frames from a set of observed sequences. In this method, image-based features are sequentially passed through a LSTM in encoding/decoding manner. Figure 12 depicts the method in details. The method consists of encoders (E), decoders (D), and a matching network (M). The network is trained using several image sequences. Table 2 shows reference feature prediction of the problems shown in Table 1.

Experiments using Baselines
We present the experiment results in this section. The first step of the method is to detect shapes from a given image. We have experimented with state-of-theart convolutional networks including ResNet50, ResNet101 [16], VGG16 [17], GoogleNet [18] and YOLO [15]. YOLO has been found to be the best architecture for the present case. 70% of the data have been used for training and 30% for testing across all experiments. We have performed 10 folds cross validation and reported the average results. Table 3 summarizes the shape detection results.
In the next stage, classification of the problem type has been carried out. The   Figure 13. It is observed that the proposed method can successfully classify the problems with reasonably high accuracy. 71.11 GoogleNet [18] 77.22 YOLO [15] 86.76 We have carried out several experiments to understand the behavior of the DR problem solver. We have taken image-based features as baseline and applied stateof-the-art recurrent neural network (RNN) to solve the reasoning problems. The results are summarized in Table 4. Figure 14 depicts some success and failure cases.

Conclusion
In this paper, we have introduced a new dataset for solving diagrammatic reasoning (DR) problems using machine learning and computer vision. The dataset can open up new challenges to the vision community. We have experimented with several state-of-the-art learning frameworks to solve typical 4 × 1 DR problem. It has been observed that the image-based analysis usually fails to answer correctly  in many cases. We have introduced a new feature set called relational feature. A rule-based learning with the help of LSTM has been used to classify the DR questions. Results reflect that the proposed rule-based method outperforms existing image-based analysis. It has been observed that simple rules defined in this work may not be sufficient to solve all types of DR problems. Complicated rules need to be defined and we may need to redefine the feature-set for solving complex DR problems. Mainly, other types (OT) DR problems need further attention of the research community.