AROS: Affordance Recognition with One-Shot Human Stances

We present Affordance Recognition with One-Shot Human Stances (AROS), a one-shot learning approach that uses an explicit representation of interactions between highly articulated human poses and 3D scenes. The approach is one-shot since it does not require iterative training or retraining to add new affordance instances. Furthermore, only one or a small handful of examples of the target pose are needed to describe the interactions. Given a 3D mesh of a previously unseen scene, we can predict affordance locations that support the interactions and generate corresponding articulated 3D human bodies around them. We evaluate the performance of our approach on three public datasets of scanned real environments with varied degrees of noise. Through rigorous statistical analysis of crowdsourced evaluations, our results show that our one-shot approach is preferred up to 80% of the time over data-intensive baselines.


Introduction
Vision evolved to make inferences in a 3D world, and one of the most important assessments we can make is what can be done where. Detecting such environmental affordances allows the identification of locations that support actions, such as stand-able, walk-able, place-able, and sit-able. Human affordance detection is not only important in scene analysis and scene understanding but also potentially beneficial in object detection and labeling (via how objects can be used) and can eventually be useful for scene generation as well.
Recent approaches have worked toward providing such key competency to artificial systems via iterative methods, such as deep learning (Zhang et al., 2020a;Bochkovskiy et al., 2020;Carion et al., 2020;Du et al., 2020;Nekrasov et al., 2021). The effectiveness of these data-driven efforts is highly dependent on the number of classes, the number of examples per class, and their diversity. Usually, a dataset consists of thousands of examples, and the training process requires a significant amount of hand tuning and computing of resources. When a new category needs to be added, further sufficient samples need to be provided and training remade. The appeal for one-shot training methods is clear.
Often, human pose-in-scene detection is conflated with object detection or other semantic scene recognition, for example, training to detect sit-able locations through chair recognition, while this is a flawed approach for general action-scene understanding, first, since people can recognize numerous non-chair locations where they can sit, e.g., on tables or cabinets (Figure 1). Second, an object-driven approach may fail to consider that affordance detection depends on the object pose and its surroundings-it should not detect a chair as sitable if it is upside-down or if an object is over it. Finally, object detectors alone may struggle FIGURE 1 AROS is capable of detecting human-scene interactions with one-shot learning. Given a scene, our approach can detect locations that support interactions and generate the interacting human body in a natural and plausible way. Images show examples of detected sit-able, reach-able, lie-able, and stand-able locations.
to perceive a potentially sit-able place if a particular object example was not covered during training.
To address these limitations, Affordance Recognition with One-shot Human Stances (AROS) uses a direct representation of human-scene affordances. It extracts an explainable geometrical description by analyzing proximity zones and clearance space between interacting entities. The approach allows training from one or very few data samples per affordance and is capable of handling noisy scene data as provided by real visual sensors, such as RGBD and stereo cameras.
In summary, our contributions are as follows: 1) we propose a one-shot learning geometric-driven affordance descriptor that captures both proximity zones and clearance space around human-pose interactions. 2) We set a statistical framework that relies on both central tendency statistics and a statistical inference to evaluate the performance of the compared approaches. The tests show that our approach generates natural and physically plausible human-scene interactions with better performance than intensively trained state-of-the-art methods. 3) Our approach demonstrates control on the kind of human-scene interaction sought, which permits exploring scenes with a concatenation of affordances.

Related work
Following Gibson's suggestion that affordances are what we perceive when looking at scenes or objects (Gibson, 1977), the perception of human affordances with computational approaches has been extensively explored over the years. Before the popularity of data-intensive approaches, Gupta et al. (2011) employed an environment geometric estimation and a voxelized discretization of four human poses to measure the environment affordance capabilities. This human pose method was employed by Fouhey et al. (2015) to automatically generate thousands of labeled RGB frames from the NYUv2 dataset (Silberman et al., 2012) for training a neural network and a set of local discriminative templates that permits the detection of four human affordances. A related approach was explored by Roy and Todorovic (2016), where detection was performed for five different human affordances through a pipeline of CNNs that includes the extraction of mid-level cues trained on the NYUv2 dataset (Silberman et al., 2012). Luddecke and Worgotter (2017) implemented a residual neural network for detecting 15 human affordances and trained using a look-up table that assigns affordances to object parts on the ADE20K dataset (Zhou et al., 2017).
Another research line has been the creation of action maps. Savva et al. (2014) generated affordance maps by learning relations between human poses and geometries in recorded human actions. Piyathilaka and Kodagoda (2015) used human skeleton models positioned in different locations in an environment to measure geometrical features and determine the support required. In Rhinehart and Kitani (2016), egocentric videos as well as scenes, objects, and actions classifiers were used to build up the action maps.
There have been efforts to use functional reasoning for describing the purpose of elements in the environment that helped define them. Grabner et al. (2011) designed a geometric detector for sit-able objects, such as chairs, while further explorations performed by Zhu et al. (2016) and Wu et al. (2020) included physics engines to Frontiers in Robotics and AI 02 frontiersin.org ponder constrains, such as collision, inertia friction, and gravity. An important line of research is focused on generating human-environment interactions, representative of affordances detected in the environment. Wang et al. (2017) proposed an affordance predictor and a 2D human interaction generator trained on more than 20K images extracted from sitcoms with and without humans interacting with the environment. Li et al. (2019) extended this work by developing a 3D human pose synthesizer that learns on the same dataset of images but generates human interactions into input scenes that are represented as RGB, RGBD, or depth images. Jiang et al. (2016) exploited the spatial correlation between elements and human interactions on RGBD images to generate human interactions and improve object labeling. These methods use human skeletons for representing body-environment configurations, which reduces their representativeness since contacts, collisions, and naturalness of the interactions cannot be evaluated in a reliable manner.
In further studies, Ruiz and Mayol-Cuevas (2020) developed a geometric interaction descriptor for non-articulated, rigid object shapes. Given a 3D environment, the method demonstrated good generalization on detecting physically feasible object-environment configurations. In the SMPL-X human body representation (Pavlakos et al., 2019), Zhang et al. (2020c) presented a contextaware human body generator that learned the distribution of 3D human poses conditioned to the scene depth and semantics via recordings from the PROX (Hassan et al., 2019) dataset. In a followup effort, Zhang et al. (2020b) developed a purely geometrical approach to model human-scene interactions by explicitly encoding the proximity between the body and the environment, thus only using a mesh as input. Training CNNs and related datadriven methods require the use of most, if not all, of the labeled dataset; e.g., in PROX (Hassan et al., 2019), there are 100K image frames.

AROS
Detecting human affordances in an environment is to find locations capable of supporting a given interaction between a human body and the environment. For example, the study of finding "suitable to sit" locations identifies all those places where a human can sit, which can include a range of object "classes" (sofa, bed, chair, table, etc.). Our method is motivated to develop a descriptor that characterizes such general interactions without requiring object classes by using two key components and that is lightweight in terms of data requirements while outperforming alternative baselines.
These two components weigh the extraction of characteristics from areas with high (contact) and low (clearance) physical proximity between the entities in interaction (Figure 2).
Importantly, the representation allows one-shot training per affordance, which is desirable to improve training scalability. Furthermore, our approach is capable of describing and detecting interactions between noisy data representations as obtained from visual depth sensors and highly articulated human poses.

A spatial descriptor for spatial interactions
We are inspired by recent methods that have revisited geometric features, such as the bisector surface for scene-object indexing (Zhao et al., 2014) and affordance detection (Ruiz and Mayol-Cuevas, 2020). Initiating from a spatial representation makes sense if it helps reduce data training needs and simplify explanations-as long as it can outperform data-intensive approaches. Our affordance descriptor expands on the Interaction Bisector Surface (IBS) (Zhao et al., 2014), an approximation of the well-known Bisector Surface (BS) (Peternell, 2000). Given two surfaces S 1 , S 2 ∈ ℝ 3 , the BS is the set of sphere centers that touch both surfaces at one point each. Due to its stability and geometrical characteristics, the IBS has been used in context retrieval, interaction classification, and functionality analysis (Zhao et al., 2014;Hu et al., 2015;Hu et al., 2016;Zhao et al., 2016;Zhao et al., 2017;Ruiz and Mayol-Cuevas, 2020). Our approach expands on these ideas and is geometrically intuitive and straightforward. It explicitly captures areas that are important to be in scene-contact and those that are not. Importantly, we show how this approach can be generalized from just one or a small number of samples to a large unseen number of scenes.
Our one-shot training process represents interactions by 3tuples (M h , M e , and p train ), where M h is a posed human-body mesh, M e is an environment mesh, and p train is the reference point on M e that supports the interaction. Let P h and P e be the sets of samples on M h and M e , respectively, their IBS I is defined as We use the Voronoi diagram D generated with P h and P e to produce I. By construction, every ridge in D is equidistant to the couple of points that defined it. Then, I is composed of ridges in D generated because of points from both P h and P e . An IBS can reach infinity, but we limit I by clipping it with the bounding sphere of M h with tolerance ibs rf .
The number and distribution of samples in P h and P e are crucial for a well-constructed discrete IBS. A low rate of sampled points degenerates on an IBS that pierces the boundaries of M h or M e . A higher density is critical in those zones where the proximity is high. To populate P h and P e , we first use a Poisson-disc sampling strategy (Yuksel, 2015) to generate ibs ini evenly distributed samples on each mesh surface. Then, we perform a counter-part sampling that increases the number of samples in P e by including the closest points on M e to elements in P h , and similarly, we incorporate in P h the closest point on M h to samples in P e . We perform the counterpart sampling strategy ibs cs times to generate a new I. However, we observed that for intricate human-scene poses, convergence to an IBS without mesh piercing is challenging. If the IBS is penetrating the scene, we perform a collision-point sampling strategy. This adds as sampling points, a sub-sample of points where collisions happen and their counter-part points (body or environment). We then simply recompute the IBS and repeat the counter-part sampling and collision-point sampling strategies until we find a candidate I that does not collide with M h or M e . This is a straightforward process that can be implemented efficiently.
To capture the regions of interaction proximity on our enhanced IBS as mentioned above, we use the notion of provenance vectors (Center) Only during training, we calculate the Voronoi diagram with sample points from both the environment and body surfaces to generate an IBS. (Right) We use the IBS to characterize the proximity zones and the surrounding space with provenance and clearance vectors. A weighted sample of these provenance and clearance vectors, V train and C train , respectively, results in good generalization of the interaction. (Ruiz and Mayol-Cuevas, 2020). The provenance vectors of an interaction start from any point on I and finish on M e . Formally, where a is the stating point of the delta vector ⃗ v to the nearest point on M e .
Provenance vectors inform about the direction and distance of the interaction; the smaller the | ⃗ v|, the more important it is in the description. Let V ′ p ⊂ V p be the subset of provenance vectors that finish on any point in P e , and we perform a weighted randomized selection sampling of elements from V ′ p with the allocation of weights as follows: where | ⃗ v max | and | ⃗ v min | are the norms of the biggest and smallest vectors in V ′ p , respectively. The selected provenance vectors V train integrate to our affordance descriptor with an adjustment to normalize their positions, with the defined reference point p train as follows: where num pv is the number of samples from V ′ p to integrate. The provenance vectors alone, however, are insufficient to work successfully on highly articulated objects, such as human poses. They are unable to capture the whole nature of the interaction. We expand this concept by taking a more comprehensive description that considers both areas of the IBS, those that are proximal to surfaces and those that are not.
We include a set of vectors into our descriptor to define the clearance space necessary for performing the given interaction. Given S h , an evenly sampled set of num cv points on M h , the clearance vectors that integrate to our descriptor C train on the interaction are defined as follows: where p train is the defined reference point,n i is the unit surface normal vector on sample s j , d max is the maximum norm of any ⃗ c j , and φ(s j ,n j , I) is the distance traveled by a ray with origin s j and directionn i until collision with I.
Formally, our affordance descriptor, AROS, is defined as wheren train is the unit normal vector on M e at p train . We calculatê n train for speeding up the detection process.

Human affordance detection
Let A = (V train , C train ,n train ) be an affordance descriptor; we define its rigid transformation with τ ∈ ℝ 3 being a translation vector and ϕ being the rotation around z defined by R ϕ .
Given a point p test on an environment mesh M test and its unit surface normal vectorn test , we determine that such a location supports a trained interaction A if we can find that (1) has a small angle difference betweenn test andn train , (2) once translated to p test and oriented with ϕ test , there is a correct alignment of V A ϕτ , and (3) a gated number of the C A ϕτ is in collision with M test . A significant angle difference betweenn test andn train permits to short-cut the test and reject p test with reference to A. We establish

FIGURE 3
Approach for detecting human affordances. To mitigate 3D scan noise, the scene is augmented with spherical fillers for detecting collisions and SDF values. Our method detects if a test point in the environment can support an interaction by translating the descriptor to the test position over different orientations and measuring its alignment and collision rate. Then, the best-scored configuration is optimized to generate a more natural and physically plausible interaction with the environment. ρ ⃗ n as the decision threshold for the angle difference. ρ ⃗ n is adjustable based on the level of mesh noise.
If we observe a normal match betweenp train and p test vectors, we perform transformations over the interaction descriptor A with τ = p test and n ϕ different ϕ = ϕ test values within [0, 2π]. Hence, per each 3-tuple (V A ϕτ , C A ϕτ ,n train ) calculated, we generated a set of rays R pv defined as follows: where a ′′ i is the starting point andν i ∈ ℝ 3 is the direction of each ray. We extend each ray in R pv by ϵ pv i until collision with M test as and compare with the magnitude of each correspondent provenance vector in V A ϕτ . When any element in R pv extends further than a predetermined limit max long , the collision with the environment is classified as non-colliding. We calculate the alignment score κ as a sum difference between extended rays and provenance vectors with The bigger the κ value, the less the support for the interaction on the p test . We experimentally determine interaction-wise thresholds for the sum of differences max κ and the number of missing ray collisions max missings that permits us to score the affordance capabilities on p test .
Clearance vectors are meant to fast-detect collision configurations by ray-mesh intersection calculation. Similar to provenance vectors, we generate a set of rays R cv , whose origins and directions are determined by C A ϕτ . We extend rays in R cv until collision with the environment and calculate its extension ϵ cv j . Extended rays with ϵ cv j ≤ ‖ ⃗ c j ‖ are considered as possible collisions. In practice, we also track an interaction-wise threshold to refuse affordance due to collisions max collisions .
A sparse distribution of clearance vectors on bi-dimensional noisy meshes in a 3D space results in collisions that are not detected by clearance vectors. To improve, we enhance scenes with a set of spherical fillers that pad the scene (see Figure 3). More details are provided in Supplementary Material.

Pose optimization
After a positive detection, we generate the body mesh representation used in training at the testing location. This generally has low levels of contact with the unseen environment. These gaps are because our descriptor based its construction on the bisector surface between the interacting entities. We can eliminate the gap by translating the body until it touches the environment. However, this naïve method generates configurations that visually lack naturalness, Figure 3 (Pose with best score).
Every human-environment configuration trained has an associated 3D human SMPL-X characterization that we keep and use to optimize the human pose as in the work of Zhang S. et al. (2020b) with the AdvOptim loss function, using the SDF values that have been pre-calculated in each scene with a grid of 256 × 256 × 256 positions.
Overall, we train a human interaction by generating its AROS descriptor from a single example, keeping the associated SMPL-X parameters of the body pose and defining the contact regions that the body has with the environment. After a positive detection with AROS, we use the associated SMPL-X body parameters and its contact regions to close the environment-body gap and generate a more natural body pose, as shown in Figure 3 (ouput). Our approach generalizes well on the description of interaction and generates natural and physically plausible body-environment configurations over novel environments with just one example for training (see Figure 4).

Experiments
We conduct experiments in various environment configurations to examine the effectiveness and usefulness of the affordance recognition performed by AROS. Our experiments include several perceptual studies, as well as a physical plausibility evaluation of the body-environment configurations generated.
Datasets: The PROX dataset (Hassan et al., 2019) includes data from 20 recordings of subjects interacting within 12 scanned indoor environments. An SMPL-X body model (Pavlakos et al., 2019) is used to characterize the shape and pose of humans within each frame Our one-shot learning approach generalizes well on affordance detection. Only one example of an interaction is used to generate an AROS descriptor that generalizes well for the detection of affordances over previously unseen environments.
in recordings. Following the setup in the work of Zhang S. et al. (2020b), we use the rooms MPH16, MPH1Library, N0SittingBooth, and N3OpenArea for testing purposes and training on data from other PROX scenes. We also perform evaluations on seven scanned scenes from the MP3D dataset (Chang et al., 2017) and five scenes from the Replica dataset (Straub et al., 2019). We calculate the spherical fillers and SDF values of all 3D scanned environments.
Training: We manually select 23 frames in which subjects interact in one of the following ways: sitting, standing, lying down, walking, or reaching. From these selected human-scene interactions, we generate the AROS descriptors and retain the SMPL-X parameters associated with human poses.
To generate the IBS associated with each trained interaction, we use an initial sampling set of ibs ini = 400 on each surface, execute the counter-part sampling strategy ibs cs = 4 times, and crop the generated IBS I with ibs rf = 1.2. The AROS descriptors are a compound of num pv = 512 provenance vectors and num cv = 256 clearance vectors that extend up to d max = 5 [cm] each.
The interaction-wise thresholds max κ , max missings , and max collisions are established experimentally, and max long is 1.2 times the radius of the sphere used to crop I. We use a moderate angle difference threshold of ρ ⃗ n = π/3, in n ϕ = 8 different directions. With 512 provenance vectors V train and 256 clearance vectors C train , the AROS descriptor characterizes an interaction with less than 40 KB, including the SMPL-X parameters.
Baselines: We compare our approach with the state-ofthe-art PLACE (Zhang et al., 2020b) and POSA (contact only) (Hassan et al., 2021). PLACE is a pure scene-centric method that only requires a reference point on a scanned environment to generate a human body performing around it. However, PLACE does not have control over the type of interaction detected/generated. We used naive and optimized versions of this approach in experiments (PLACE, PLACE SimOptim, and PLACE AdvOptim). POSA is a human-centric approach that, given a posed human body mesh, calculates the zones on the body where contact with the scene may occur and uses this feature map to place the body in the environment. We encourage a fair comparison by evaluating the naive and optimized POSA versions that consider only contact information and excludes semantic information (POSA and POSA optimized). In our studies, POSA was executed with the same human shapes and poses used to train AROS.

Physical plausibility
We evaluate the physical plausibility of the compared approaches mainly by following the work of Zhang et al. (2020b) and Zhang et al. (2020c). Given the SDF values of a scene and a body mesh generated, 1) the contact score is assigned to 1 if any mesh vertex has a negative SDF value and is evaluated as 0, otherwise, 2) the non-collision score is the ratio of vertices with a positive SDF value, and 3) in order to measure the severity of the body-environment collision on positive contact, we include the collision-depth score, which averages the depth of the collisions between the scene and the generated body mesh.

Ablation study
We evaluate the influence of clearance vectors, spherical fillers, and different optimizers on the PROX dataset. Three different optimization procedures are evaluated. The downward optimizer translates the generated body downward (-Z direction) until it comes in contact with the environment. The ICP optimizer uses the well-known Interactive Closest Point algorithm to align the body vertices with the environment mesh. The AdvOptim optimizer is described in Section 3.2.1. Table 1 shows that models without clearance vectors have the highest collision-depth scores on models with the same optimizer. AROS models present a reduction in contact and collision-depth scores in all cases that consider clearance vectors in their descriptors to avoid collision with the environment. Spherical fillers have a significant influence on avoiding collisions, producing the best scores in all metrics per optimizer. The ICP optimizer closes the body-environment gaps but drastically reduces the performance on both collision scores, while the AdvOptim and downward optimizers keep a trade-off between collision and contact. The best performance is achieved with affordance descriptors composed of provenance and clearance vectors, tested in scanned environments enhanced

Comparison with the state of the art
We generated 1300 interacting bodies per model in each of the 16 scenes and reported the averages of calculated non-collision, contact, and collision-depth scores. The results are shown in Table 2. In all datasets, interacting bodies generated using our approach provided a good trade-off with high non-collision but low contact and collision-depth scores.

Perception of naturalness
We use Amazon Mechanical Turk to compare and evaluate the naturalness of body-environment configurations generated by our approach and baselines. We used only the best version of the compared methods (with optimizer). Each scene in our test set was used equally to select 162 locations around which the compared approaches generate human interactions. MTurk judges observed all human-environment pairs generated through dynamic views, allowing us to showcase them from different perspectives. Each judge performed 11 randomly selected assessments, without repetition, that included two control questions to detect and exclude untrustworthy evaluators. Three different judges accomplished each of the evaluations. Our perceptual experiments include individual and comparison studies for each comparison carried out.
In our side-by-side comparison studies, interactions detected/generated from two approaches are exposed simultaneously. Then, MTurkers were asked to respond to the question "Which example is more natural?" by direct selection.
We used the same set of interactions for individual evaluation studies, where judges rated every individual human-scene interaction by responding to "The human is interacting very naturally with the scene. What is your opinion?" with a 5-point Likert scale according to its agreement level: 1) strongly disagree, 2) disagree, 3) neither disagree nor agree, 4) agree, and 5) strongly agree.

FIGURE 5
Selected by a golden annotator, green spots correspond to examples of meaningful, challenging locations for affordance detection.

Randomly selected test locations
The first group of studies compares human-scene configurations generated at randomly selected locations. On the side-by-side comparison study that contrasts AROS with PLACE, our approach was selected as more natural in 60.7% of all assessments. Compared to POSA, ours is selected in 72.6% of all tests performed. The results per dataset are shown in Table 4 (% preferences in random locations).
Individual evaluation studies also suggest that AROS produced more natural interactions (see Table 3). The mean and standard deviations of these scores obtained by the judges to PLACE are 3.23 ± 1.35 in comparison with AROS, 3.39 ± 1.25, while in the second study, these statistics obtained by POSA were 2.79 ± 1.18 in contrast with AROS, 3.20 ± 1.18. Evaluation scores of AROS have a larger mean and a narrower standard deviation compared to baselines. However, these descriptive statistics must be cautiously used as evidence to determine a performance difference because it assumes that the distribution of scores approximately resembles a normal distribution and that the ordinal variable was perceived as numerically equidistant by judges. Regrettably, Shapiro-Wilk tests (Shapiro and Wilk, 1965) performed on data show that the score distributions depart from normality in both evaluation studies, PLACE/AROS and POSA/AROS with p < 0.01.
Based on this, we performed a chi-square test of homogeneity (Franke et al., 2012) with a significance level α = 0.05, to determine if the distributions of evaluation scores are statistically similar. If we observe significance, the level of association between the approach and the distribution of the scores was determined by calculating Cramer's V value (V) (Cramer, 1946).
In this first set of randomly selected locations, data from the PLACE/AROS evaluation suggest that there is no statistically significant difference between score distributions (χ 2 (4) = 9.34, Frontiers in Robotics and AI 08 frontiersin.org AROS shows good performance on a variety of novel scenes. p = 0.053). A larger sample size may be necessary to observe statistical significance; however, this will be of negligible size effect. Nevertheless, data from the POSA/AROS evaluation study showed that our approach performs better than POSA (χ 2 (4) = 32.33, p < 0.001) with a medium level of association (V = 0.1823).

Challenging test locations
A random sampling strategy is insufficient to fully evaluate the performance of pose affordances, since what matters for such methods is how they perform under realistic albeit challenging specific scene locations. For example, a test can be oversimplified and inadequate for evaluations if the sampled scene has relatively large empty spaces where only the floor or a big plane surface surrounds the test locations. Therefore, we crowdsource the evaluations in a new set of more realistic locations provided by a golden annotator (none of the authors) tasked with identifying areas of interest for human interactions (Figure 5). These locations are available for comparison as part of our dataset (https://abelpaor.github.io/ AROS/).
The results of the side-by-side comparison studies confirm that in 60.6% of the comparisons with PLACE, AROS was considered more natural overall. Compared to POSA, AROS was marked with better performance in 76.1% of all evaluations with a notorious difference in MP3D locations, where AROS was evaluated to be more natural in 80.2% of the assessments. The results per dataset are shown in Table 4 (% preferences in challenging locations).
As in the randomly selected test locations, a descriptive analysis of the data from individual evaluation studies on these new locations suggests that AROS performs better than other approaches with larger mean values and narrower standard deviations. The mean and standard deviation of the scores obtained by the judges to PLACE are 2.97 ± 1.33 in comparison with AROS, 3.44 ± 1.19, while in the second study, these statistics obtained by POSA were 2.79 ± 1.25 in contrast with AROS, 3.5 ± 1.25. However, a Shapiro-Wilk test performed on these data shows that the score distributions also depart from normality with p < 0.01 in both studies, PLACE/AROS and POSA/AROS.
A chi-square test of homogeneity, with α = 0.05, was used to determine whether both score distributions were statistically similar on the data from the PLACE/AROS evaluation study, providing evidence that there is a difference in score distributions (χ 2 (4) = 35.92, p < 0.001) with a medium level of association (V = 0.192).
However, an omnibus χ 2 statistic does not provide information about the source of the difference between the score distributions. To this end, we performed a post hoc analysis following the standardized residuals method described in the work of Agresti (2018). As suggested by Beasley and Schumacker (1995), we corrected our significance level (α = 0.05) with the Sidak method (Šidák, 1967) to its adjusted version α adj = 0.005, with critical value z = 2.81. The study revealed a significant difference in the qualification of the interactions generated by PLACE and AROS, with ours being qualified as natural more frequently.
The residuals associated with AROS indicate, with significant difference, that the interactions generated by our approach were marked as "not natural" less frequently than expected: strongly disagree (z = −4.4, p < 0.001) and disagree (z = −2.98, p = 0.002). Data also show a significant difference in favorable evaluations, where PLACE has less frequently positive evaluations than predicted by the hypothesis of independence in agree (z = −3.04, p < 0.001). We also observed a marginal significance, still in favor of AROS, in the frequency of strongly agree evaluations (z = −2.3, p = 0.015).
Not surprisingly, the chi-square test of homogeneity (α = 0.05) on the data from the POSA/AROS evaluation study revealed that there is strong evidence of a difference in score distributions (χ 2 (4) = 75.13, p < 0.001) with a larger level of association (V = 0.278). The post hoc analysis with standardized residuals concludes that the naturalness of human-scene interactions generated by AROS is, in the long term, better than that from POSA. Table 5 shows the cross-tabulated data of the scores observed by MTurkers and their standardized residual (critical value z = 2.81 for α adj = 0.005).

Qualitative results
Experiments verify that our approaches can realistically generate human bodies that interact within a given environment in a natural and physically plausible manner. AROS allows us to not only determine the location on the environment in which we want the interaction to happen (the where) but also select the specific type of interaction to be performed (the what).
The number and variety of interactions detected by AROS can easily be increased as a result of its one-shot training capacity. The more trained the interactions, the more the human-scene configuration can detect/generate. Figure 6 shows examples of different affordance detections around single locations.
AROS showed better performance in more realistic environment configurations where elements, such as chairs, sofas, tables, and walls, are presented and must be considered during the generation of body interactions. Figure 7 shows some examples of interaction generated by AROS and baselines over challenging locations.
Alternatively, AROS can be used to concatenate affordances over several positions to generate useful affordance maps for action planners (see Figure 8). This can be used as a way to generate visualizations of action scripts or to plan the ergonomics and usability of spaces beyond individual objects.

Conclusion
In this work, we present AROS, a one-shot geometric-driven affordance descriptor that is built on the bisector surface and combines proximity zones and clearance space to improve the affordance characterization of human poses. We introduced a generative framework that poses 3D human bodies interacting within a 3D environment in a natural and physically plausible manner. AROS shows a good generalization in unseen novel scenes. Furthermore, adding a new interaction to AROS is straightforward, since it requires only one example. Via rigorous statistical analysis, results show that our one-shot approach outperforms data-intensive baselines, with human judges preferring AROS proposals 80% of the time over the baselines. AROS can be used to concatenate affordances over several positions. This can be used as a way to generate visualizations of action scripts in 3D scenes or to plan the ergonomics and usability of spaces beyond individual object affordances. We believe that explicit and interpretable description is valuable for complementing data-driven methods and opens avenues for further work, including combining the strengths of both approaches.

Data availability statement
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at: https://abelpaor.github.io/AROS/.