PROST: Physical Reasoning about Objects through Space and Time

We present a new probing dataset named PROST: Physical Reasoning about Objects Through Space and Time . This dataset contains 18,736 multiple-choice questions made from 14 manually curated templates, covering 10 physical reasoning concepts. All questions are designed to probe both causal and masked language models in a zero-shot setting. We conduct an extensive analysis which demonstrates that state-of-the-art pretrained models are inadequate at physical reasoning: they are inﬂuenced by the order in which answer options are presented to them, they struggle when the superlative in a question is inverted (e.g., most ↔ least ), and increasing the amount of pretraining data and parameters only yields minimal improvements. These results provide support for the hypothesis that current pre-trained models’ ability to reason about physical interactions is inherently limited by a lack of real world experience. By highlighting these limitations, we hope to motivate the development of models with a human-like understanding of the physical world.


Introduction
In the context of natural language processing (NLP), Bender and Koller (2020) provides a working definition of "understanding" as the ability to recover the communicative intent from an utterance.To achieve this, one must be able to query a set of concepts that is aligned with the speaker's own understanding.An example of such alignment is our interaction with the physical world.This experience, shared by all humans, provides a common set of concepts to rely on in communication.For example, the reader can map the phrase I dropped my pint glass to a set of relevant experiences and generate a mental depiction of the scene.Further, * *Email has no accent, but includes the hyphen  the reader can also use their knowledge of gravity and the properties of a pint glass to reason about potential outcomes: the pint glass will fall toward the ground and will likely break on impact.
Children grab, push, and play with the objects around them to form concepts about the world they live in even before learning to talk (Hespos and Spelke, 2004).These concepts are then linked with words to enable communication, eventually providing the necessary grounds for concepts and language to co-develop (Bloom, 2002;Gelman, 2009).In contrast, current language models (LMs) are not exposed to real-world experiences, making them incapable of grounding language (Bisk et al., 2020a).We hypothesize that this lack of experience impedes their ability to both understand an utterance relating to the physical world and their ability to reason about its implications.
In order to investigate our hypothesis, we create PROST: Physical Reasoning of Objects Through Space and Time, a probing dataset to evaluate the ability of pretrained LMs to understand and reason about the physical world.PROST consists of multiple-choice cloze-style questions covering 10 basic concepts: direction, mass, height, circumference, stackable, rollable, graspable, breakable, slideable, and bounceable.Importantly, PROST is designed to avoid models succeeding in unintended ways.First, PROST provides no training data, so as to probe models in a zero-shot fashion.This prevents models from succeeding through spurious correlations between training and test data and encourages success through a true understanding of and reasoning about the concepts at hand.Second, we manually write templates for all questions in an effort to prevent models from having seen the exact same sentences in their training data.Finally, it focuses on a small set of well defined, objective concepts that only require a small vocabulary.This allows researchers to focus more on the quality of training data rather than the size of it.
Contributions We make two contributions: 1) We introduce PROST, a dataset with 18, 736 clozestyle questions created from 14 manually written templates, covering 10 physical reasoning tasks.2) We conduct an extensive analysis which demonstrates that state-of-the-art pretrained models are inadequate at physical reasoning.More specifically, they are influenced by the order in which answer options are presented to them, they struggle when the superlative in a question is inverted (e.g., most ↔ least), and increasing the amount of pretraining data and parameters only yields minimal improvements.The dataset and code is available at github.com/nala-cub/prost.

Related Work
Evaluation of Reasoning Abilities As pretrained models are excelling on many NLP tasks, more work is being done on understanding their abilities.A subset of this work focuses on physical reasoning.PIQA (Bisk et al., 2020b) tests physical commonsense, with concepts ranging from hard shell tacos to separating egg yolks.In order to succeed on PIQA through reasoning, a model would need to be able to understand thousands of human experiences.In contrast, PROST provides a first step towards grounded understanding and reasoning by focusing on a few simple concepts.Bakhtin et al. (2019) provides a set of 2D puzzles that involve placing a new object in a scene to accomplish a goal.This research also focuses on simple physics, however there is no language component.Clark et al. (2018) and Kembhavi et al. (2017) both provide a large set of grade school multiple-choice questions, including some that could be solved with reasoning.However both provide corresponding material where the solution can be found, relying more on information retrieval than a general under-standing and reasoning about the world.
Another set of reasoning-based benchmarks focuses on common sense reasoning.SWAG and its extension hellaSWAG evaluate commonsense natural language inference (Zellers et al., 2018(Zellers et al., , 2019)).Sap et al. (2019) tests commonsense reasoning about social situations.However, commonsense reasoning is often subjective and requires understanding of complex human-human interactions involving social and societal norms.In contrast, physical reasoning is based on objective and well defined constructs.Other datasets (Forbes and Choi, 2017;Elazar et al., 2019;Goel et al., 2019) focus on object-attribute comparison.However, they compare concepts at a word level rather than sentence level and use a large training set to create an engineered object-attribute comparison model.It is difficult to see how these models could generalize to other forms of reasoning.
Moreover, all the above datasets follow a pretraining-agnostic identically distributed (PAID) paradigm (Linzen, 2020), making them susceptible to models that can leverage unintended correlations between the training and test sets.
Zero-Shot LM Probes Similar to PROST, several recent benchmarks have circumvented the concern of identically distributed training and test sets by probing models in a zero-shot manner.Petroni et al. (2019) queries masked LMs (MLMs) for factual knowledge using templates in the format of Dante was born in [MASK].Talmor et al. (2020) use a similar format to probe six concepts ranging from age comparison to taxonomy conjunction.Ettinger (2020) uses this format to show that BERT robustly retrieves hypernyms, but fails to understand negation.Lin et al. (2020) probe numerical commensense in both MLMs and traditional LMs.Warstadt et al. (2020) measures traditional LMs' sense of grammatical acceptability by comparing sentence probabilities.
Grounded Language Environments PROST investigates if pretrained models show a lack of understanding of the physical world which could result from learning language without grounding.While not used for pretraining, a number of multi-modal environments have been developed to ground language.Shridhar et al. (2020)'s AL-FRED builds on other vision-and-language navigation environments (Gordon et al., 2018;Regneri et al., 2013;Zhu et al., 2017;Anderson et al., 2018), and enables grounding of language instruction to actions, behaviours, and objects.BABYAI (Chevalier-Boisvert et al., 2019) and BABYAI++ (Cao et al., 2020) provide an environment to ground simple language in a gridworld.Additionally, other work has explored grounding language in simulations or the real world (Hill et al., 2020;Lynch and Sermanet, 2020).While they provide important resources to ground language, little emphasis is placed on the language modules themselves.They are often trained tabulae rasae, learning language for a singular purpose and missing out on the syntax and coverage learnt during pretraining;1 language is only ever an input, and no analysis has been done on how language understanding evolves as the agent learns to succeed on different tasks.

PROST
PROST consists of 18, 736 cloze-style multiplechoice questions designed for probing a LM's physical reasoning ability.They cover 10 basic concepts: direction, mass, height, circumference, stackable, rollable, graspable, breakable, slideable, and bounceable.We choose these concepts because they are well defined, easily learned by interacting with the world, and are useful concepts for any embodied agent.The questions are constructed from 14 manually written templates.Each template follows one of three different formats: the first format is specific to the set of questions pertaining to directions; the second format is used to gauge the relative attributes-specifically mass, height, and circumference-of objects; and the third format targets the affordances of objects-specifically whether an object is stackable, rollable, graspable, or breakable, and whether a surfaces is slideable or bounceable2 .We use CheckList (Ribeiro et al., 2020) to obtain the questions from our templates.We show all templates in Table 1 and explain them in detail below.We end this section by describing the objects featured in PROST.

Direction Templates
We use two templates to generate questions which probe understanding of direction.The first focuses on cardinal directions.The second uses a set of four manually crafted questions to probe understanding of how gravity affects the directions of a ball throughout its trajectory.Due to their similarity, we count these four questions as a single template.The direction templates create a total of 16 questions.

Attribute Templates
The second set of templates probe the models' ability to reason about relative mass, height, and circumference of common objects.For each of these three concepts we create a set of six objects that are easily ordered by their respective attributes.A context is first presented with up to four of the six objects to prime the models with the range of possible choices.This is followed by a prompt that probes the model to select one of the objects based on the object's mass, height, or circumference.By inverting the superlative in the prompt (e.g., longest ↔ shortest), we can probe the model's ability to identify both the object with the highest attribute value and the object with the lowest attribute value from the set of choices.We permute through all objects and all orders.Each of the three attributes are tested using two templates that share the same set of objects.Each template produces 6 P 4 * 2 = 720 questions, meaning each attribute is probed using 1440 questions.

Affordance templates
The remaining templates target an understanding of object affordances.For each affordance-stackable, rollable, graspable, breakable, slideable, and bounceable-we collect a set of five objects with and five objects without that affordance.Again, we first provide a short context that contains each of the four possible objects.We then provide a prompt that requires the model to select the only object either with or without the affordance.We include all permutations of objects where there is exactly one correct answer.These templates produce 5 P 1 * 5 P 3 * 4 * 2 = 2400 questions for each of the six affordances.
Objects in PROST All possible values for the placeholders in our templates are shown in Table 3.For affordances, we display objects in two groups: those with and without each affordance.For attributes, objects are sorted by increasing order, e.g., for mass, leaf is the lightest object and microwave is the heaviest object.Each object in PROST is selected to be single-token compatible for a wide range of vocabularies to enable easy probing of MLMs.We validate the order of our attribute objects and the group membership for our affordance objects by collecting judgments from 9 human val-  3. The rest of the placeholders show their possibilities in the braces themselves.
[MASK] indicates the position of the blank that the models need to fill.See Section 3 for more information.NOTE: The number of objects with and without the affordances are swapped when the superlative is inverted.

Models
Using PROST, we probe three types of transformerbased models (Vaswani et al., 2017): decoder models, encoder models, and encoder-decoder models.Each model has slightly different formatting requirements, which we show in Table 2.For each model type, we probe a range of different sizes to investigate the effects of scaling.We use Huggingface's (Wolf et al., 2020) pretrained models, see Table 4 for the full set.

Decoder Models
We analyze OpenAI's GPT-1 (Radford et al., 2018) and GPT-2 (Radford et al., 2019).Both are based on a transformer decoder architecture and trained on a traditional language modeling objective.We run these models over

Encoder Models
We analyze BERT (uncased) (Devlin et al., 2018), RoBERTa (Liu et al., 2019), and ALBERT (Lan et al., 2020), which are all based on transformer encoders.BERT is trained on MLM and next sentence prediction and uses static masking, RoBERTa is trained on MLM with dynamic masking, and ALBERT uses whole-word n-gram masking.For probing, we filter out all but the four answer choices from the output vocabulary and select the token with the highest probability as the model's decision.

Encoder-decoder Models
We also include results for T5 (Raffel et al., 2020).T5 is trained using a span corruption objective, in which spans of the input sequence are randomly replaced with a single mask token.During pretraining, span lengths are chosen randomly with an average length of three.
To keep our results consistent with the other models, we restrict the span length to one token.We find that two of the options for sliding surfaces, namely ice and frost, violate our single-token constraint.To avoid any unfair comparison between answers that differ in token lengths and following previous work (Goldberg, 2019), we chose to omit presenting the results for T5 on the sliding concept.
Finetuned Conditional LMs To better understand the limitations of text-only training, we additionally evaluate UnifiedQA (Khashabi et al., 2020).

Results
The per model and per concept results are shown in Table 5.For concepts with more than one template-direction, mass, height, and circumference-we average across templates to get the concept's score.
We can see that, on average, ALBERT-V2-XL performs best, with an accuracy of 31.8%4 , and GPT-2 performs worst, with an accuracy of 23.6%.We note that random guessing would yield an accuracy of 25%.Furthermore, every model underperforms random guessing on at least one concept.Since PROST is trivially solvable for humans, this supports our hypothesis that pretrained models are unable to perform physical reasoning anywhere close to human performance.
Comparing across all concepts, we see that direction obtains the highest average accuracy with 46.8%.The second best accuracy is observed for the mass attribute with 36.5%.The concepts models struggle the most with are the slideable and bounceable affordances, both with an average accuracy of 19.9%.

Analysis
Object Order in Context For the concepts that use objects, all four choices are listed in each question's context.PROST contains all permutations with regards to their ordering.This enables us to directly look at the effect of the correct answer's position within the context on the models' accuracy.These results shown in Table 6.
We see that models have a strong tendency to select either the first or the last item seen in the context.The largest difference is found for T5, with an accuracy of 52.4% for objects at position 1 and an accuracy of only 1.9% for objects at position 3.We note that a proper understanding of the questions, as most humans would have, would be robust to the order in which the choices are presented.This further underlines that state-of-the-art models do not perform human-like physical reasoning.Superlative Inverses By inverting the superlative in a question, we are able to probe a mirrored version of the question.For example, for attributes, this would require the model to identify the lightest object instead of the heaviest object, or, for affordances, it would require the model to identify the not stackable object instead of the stackable object.We call these mirrored versions superlative inverses.A true understanding of the questions in PROST should be robust to this kind of inversion.However, Table 7 shows all models perform better on one of the two versions.Of the probed models, GPT-2 is the most unbalanced, averaging 30.6% higher for one version over the other.

Data and Model Scaling
Figure 2 shows each model's accuracy as a function of the number of its parameters.Unlike for many modern benchmarks, where increasing the number of parameters or training data provides significant benefits (Talmor et al., 2020;Wang et al., 2018), PROST does not see much improvement from such scaling.We observe some improvements with T5-3B outperforming T5-small, but this 6.6% increase requires a 48x increase in parameters and T5-small still outperforms T5-3B on one task.Moreover, some models break this trend: ALBERT's XL version outpeforms its XXL counterpart and GPT-2 M outperforms GPT-2 L. While previous work has revealed the impressive scaling laws of transformerbased architectures (Kaplan et al., 2020) 3, indicate that training a model on PIQA is detrimental to its performance on PROST.While PIQA and PROST share a few conceptual similarities, they differ in terms of format, style, and vocabulary.We thus hypothesize that current models learn more about these surfacelevel differences than the conceptual similarities underpinning the questions.We further highlight two key differences between the two datasets: • PROST probes models in a zero-shot fashion, whereas PIQA provides training and test sets of identically distributed examples.This makes it possible for models on PIQA to answer successfully using spurious correlations rather than physical reasoning.
• PIQA (Bisk et al., 2020b)  A number of other reasoning benchmarks have been solved to some extent by a large finetuned model.UnifiedQA (11B parameters), based on T5 (Raffel et al., 2020), achieved 81.4% on ARC (Clark et al., 2018); and UNICORN 5 (11B parameters), also based on T5, achieved a 93.9% accuracy on hellaSWAG (Zellers et al., 2019).While all these models are larger and are trained on more data, our results force us to ask the question whether they perform well because these additional parameters and data have imbued the models with an ability to reason, or if they succeed by finding subtle unintended correlations in the data.This forces us to look more closely at how models succeed, and not just the accuracy they achieve.Tools like CheckList (Ribeiro et al., 2020) can aid in this endeavor by demonstrating how robust models are to changes in the distribution of the data.
How to Use this Probe PROST is intended to help analyze any model that can be deployed in a text-only setting.However, we maintain that multimodal data is necessary to experience the concepts in PROST, and that these experiences are likely a crucial step in succeeding on this dataset.One way that multi-modal models could prepare for this type of text-only evaluation is through multi-task training, where one of the tasks is only conditioned on text.Such an approach has already been considered: Brown et al. (2020) propose an extension to their CLIP model which is trained on multiple modalities in a multi-task fashion.Because of the templated nature of PROST, its exact format can be adapted to match specific styles of language training, as we do for T5 and UnifiedQA.
PROST's language-only approach is motivated by two reasons.First, we believe that true multi-5 leaderboard.allenai.org/hellaswag/submissions/publicmodal models should be able to function on any subset of their modalities.We note that humans can easily interact with text-only inputs (e.g., a text message) while still learning from and interacting with other modalities.Second, it enables the comparison of models trained using different modalities or domains.For example, we believe comparing how language understanding modules evolve when trained on vision-and-language navigation compared to visual question answering would provide invaluable insights.
Limitations We caution that achieving a high accuracy on PROST does not necessarily guarantee that a model is able of physical reasoning.It is likely easy to succeed on this benchmark if one were to intentionally train models on similar enough sentences or a subset of PROST itself.We hope that the community will use this dataset in the intended way: in a zero-shot setting to probe models which have been trained on data not specifically collected to succeed on PROST.

Conclusion
We present a probing dataset called PROST, which is designed to test a model's ability to reason about the physical world.Our experiments show that current state-of-the-art pretrained models lack the ability to reason about physical interactions.Further, all models struggle when the order of options is changed and when questions are inverted, both things that would not confuse humans.Lastly, our analysis shows that these issues are unlikely to be solved by simply scaling models.Our results highlight the need to look beyond text-based pretraining and to provide models with the necessary experiences for human-like understanding of the physical world.
A) glass B) pillow C) coin D) penA person drops a glass, a pillow, a coin, and a pen from a balcony.The [MASK] is most likely to break.

Figure 1 :
Figure 1: An example question from PROST.
A person is walking {north/east/south/west}.They turn {left/right/around}.Q: They are now walking [MASK].O: A) north B) east C) south D) west Directs.2a 1 C: A person drops a ball.Q: Immediately after leaving the person's hand, the ball is moving toward the [MASK].Directs.2b 1 C: A person throws a ball straight into the air.Q: Immediately after leaving the person's hand, the ball is moving toward the [MASK].Directs.2c 1 C: A person throws a ball straight into the air.Q: Immediately after reaching the highest point in it's trajectory, the ball is moving toward the [MASK].Directs.2d 1 C: A person drops a ball.The ball then bounces off the ground.Q: Immediately after bouncing off the ground, the ball is moving toward the [MASK].O: A) ground B) sky C) left D) right Mass 1 720 C: A(n) {mass obj1}, a(n) {mass obj2}, a(n) {mass obj3}, and a(n) {mass obj4} moving at identical speeds each collide with a static hockey puck.Q: The puck hit by the [MASK] slides the {shortest/longest} distance.Mass 2 720 C: A(n) {mass obj1} and a(n) {mass obj2} are placed on either end of a perfectly balanced seesaw.Q: The side of the seesaw with the [MASK] moves {up/down}.O: A) {mass obj1} B) {mass obj2} C) {mass obj3} D) {mass obj4} Height 1 720 C: Four balls are dropped.The first is dropped from the height equivalent of a {height obj1}, the second is dropped from the height equivalent of a {height obj2}, the third is dropped from the height equivalent of a {height obj3}, and the fourth is dropped from the height equivalent of a {height obj4}.Q: The ball dropped from the height of the [MASK] takes the {longest/shortest} amount of time to fall.Height 2 720 C: There are four staircases.The first staircase leads to the top of a {height obj1.}, the second staircase leads to the top of a {height obj2.}, the third staircase leads to the top of a {height obj3.}, and the fourth staircase leads to the top of a {height obj4.}.Q: The staircase leading to the top of the [MASK] is the easiest/hardest to walk up.O: A) {height obj1} B) {height obj2} C) {height obj3} D) {height obj4} Circumf. 1 720 C: Four people are walking at identical speeds.The first walks around a {circ obj1}, the second walks around a {circ obj2}, the third walks around a {circ obj3}, and the fourth walks around a {circ obj4}.Q: The [MASK] takes the {longest/shortest} amount of time to walk around.Circumf. 2 720 C: A person paints a circle around a {circ obj1}, a {circ obj1}, a {circ obj1}, and a {circ obj1}.Q: The circle around the [MASK] takes the {most/least} amount of paint.O: A) {circ obj1} B) {circ obj2} C) {circ obj3} D) {circ obj4} Stackable 2400 C: A person is trying to stack {stack}, {no stack1}, {no stack2}, and {no stack3}.Q: The [MASK] are the {easiest/hardest} to stack.O: A) {stack} B) {no stack1} C) {no stack2} D) {no stack3} Rollable 2400 C: A person is trying to roll a(n) {roll}, a(n) {no roll1}, a(n) {no roll2}, and a(n) {no roll3}.Q: The [MASK] is the {easiest/hardest} to roll.O: A) {roll} B) {no roll1} C) {no roll2} D) {no roll3} Graspable 2400 C: A person is trying to move a pile of {break}, a pile of {no break1}, a pile of {no break2}, and a pile of {no break3} from one side of a room to the other using only one hand.Q: The [MASK] is the {most/least} likely to break.O: A) {break} B) {no break1} C) {no break2} D) {no break3} Breakable 2400 C: A person drops a {break}, a {no break1}, a {no break2}, and a {no break3} from a balcony.Q: The [MASK] is the {most/least} likely to break.O: A) {grasp} B) {no grasp1} C) {no grasp2} D) {no grasp3} Slideable 2400 C: A person is sliding four bricks across four hard surfaces.The first surface is covered with {slide}, the second surface is covered with {no slide1}, the third surface is covered with {no slide2}, and the fourth surface is covered with {no slide3}.Q: The surface covered with [MASK] is the {hardest/easiest} for the brick to slide across.O: A) {slide} B) {no slide1} C) {no slide2} D) {no slide3} Bounceable 2400 C: A person is trying to bounce a rubber ball.They drop a first ball onto {bounce}, a second ball onto {no bounce1}, a third ball onto {no bounce2}, and a fourth ball onto {no bounce3}.Q: The ball dropped onto[MASK] bounces the {most/fewest} times.O: A) {bounce} B) {no bounce1} C) {no bounce2} D) {no bounce3} Table 1: All templates in PROST.C: = Context, Q: = Question, O: = Options.{} indicate placeholders.The objects can be found in Table

Figure 2 :
Figure 2: Scaling effect of models on accuracy.Circles size represents number of parameters.

Table 2 :
Overview of the task preprocessing for different architectures evaluated.In all methods, the context remains unchanged and is "A person is walking west.They turn left."idators.The validators obtained a 100% agreement on the object ordering, and 94.6% agreement on the object group membership.

Table 3 :
Objects used in the templates.

Table 5 :
Macro average for each concept and overall for each model on PROST.The best accruacy for general pretrained-only models is displayed in bold.Note that the task average does not include UnifiedQA.

Table 6 :
Accuracy across the correct answer's position in the context.

Table 7 :
Absolute difference in accuracy between a question and its superlative inverse.
Comparing PROST and PIQA Due to their shared focus on text-based physical reasoning, PROST and PIQA share similarities.To test if models trained on PIQA are able to carry over any con-cepts to PROST, we further finetune a UnifiedQA model on PIQA and evaluate it on PROST.The results, shown in Figure