Dual edge classifier for robust cloth unfolding

Compared with more rigid objects, clothing items are inherently difficult for robots to recognize and manipulate. We propose a method for detecting how cloth is folded, to facilitate choosing a manipulative action that corresponds to a garment’s shape and position. The proposed method involves classifying the edges and corners of a garment by distinguishing between edges formed by folds and the hem or ragged edge of the cloth. Identifying the type of edges in a corner helps to determinate how the object is folded. This bottom-up approach, together with an active perception system, allows us to select strategies for robotic manipulation. We corroborate the method using a two-armed robot to manipulate towels of different shapes, textures, and sizes.


Introduction
In recent years, robots have contributed to a significant increase in the automation of industrial tasks. However, the level of automation of household tasks has yet to become commonplace. The demand for robots capable of assisting with household tasks is likely to increase in parallel with aging global populations, a demographic phenomenon caused by improved life expectancies and dropping birth rates. One operation that is central to many household tasks, including laundry, assisted dressing, and bed making, is the manipulation of cloth items. This skill, which is simple for most humans, is actually very difficult for robots to perform. The difficulty of cloth manipulation lies in the deformability, nonlinearity, and low predictability of the behavior of the materials. Because of their deformable nature, compared with rigid objects, cloth objects are also inherently difficult for robots to recognize. This is why it is often necessary to completely unfold cloth items prior to starting a task. An unfolded garment is easier to recognize and manipulate because a robot can then approximate the shape to a model or locate interest points like corners.
A common method of cloth unfolding is to lay the garment flat on a surface and unfold it, as in a pick-and-place problem [1][2][3][4]. In [3], similar to our method, the authors present an analysis of the types of corners in order to find strategies for unfolding. By contrast, our approach does not require a table or any flat surface, and involves simply grasping one point of the garment, lifting it into the air, and letting it hang from that point by the effect of gravity.
In this paper, we deal with a rectangular piece of cloth as a basic problem to investigate. Typical methods used to open such garments while hanging require locating predefined points and grasping them [5,6]. However, because there are often hidden folds, we analyze the depth of the garment's edges instead of searching for specific points, which allows us to extract information for forming a manipulation strategy. We distinguish between two types of edges: those that belong to the hem of the garment, which we call physical edges, and the remaining nonphysical edges, often formed by folds. Figure 1 shows an example of this edge classification. In the image on the right, the physical edges are marked in green and the nonphysical edges are marked in red. Locating physical edges is very useful for find grasping points and to better Open Access *Correspondence: a.gabas@aist.go.jp 1 CNRS-AIST JRL (Joint Robotics Laboratory), IRL3218, Tsukuba, Japan Full list of author information is available at the end of the article understand the shape of the garment. Opening the garment requires locating two corners formed by physical edges, which we call physical corners. These two corners should be consecutive i.e., connected by the same physical edge. Once located, grasping each corner with one hand leads to unfold the garment.
The configuration of edge types in the whole garment reveals some patterns. However, the high dimensionality of clothing items makes it very difficult to find global features that could identify an edge as physical or not. On the other hand, local features around edges tend to show slight differences between physical and nonphysical edges.
Therefore, to classify the edges, we propose a system that combines the results from two classifiers: a local one that selects a small patch around a pixel as an input and a global one whose input is the whole image. Finally, we present a categorization of the types of corners found in the image of the garment and use this categorization in an algorithm to actively choose the best robot action for opening the garment.
The main contributions of this work are the following: • A combined local and global classifier capable of determining edge types. • An algorithm that chooses the best course of action towards unfolding a garment according to its state, which is inferred from the types of edges. The algorithm is capable of locating physical corners even when they are occluded.
We apply this algorithm to the case of unfolding different towels and show how this skill can be applied to other garments.
In the "Related work" section, we present several related approaches to cloth manipulation. The "System overview" section describes the categorization of folding patterns for a cloth held in the air. In the "Cloth edge classification" section, we provide a detailed description of the edge type classifier. The "Action planning" section describes the algorithm that chooses the best action according to each folding pattern. Finally, in the "Experiments" section, we validate our system using different examples of rectangular cloth items.

Semantic edge detection
The use of learning techniques to detect edge information allows to perform edge segmentation with respect to more subjective criteria than classical methods such as Canny [7]. In [8], they use random forests to learn a mid-level representation based on object contours called sketch tokens. Similarly, in [9], they use boosted decision trees to extract depth maps.
Semantic edge detection goes one step further by turning this binary classification into a multiclass problem. CASEnet [10] proposes a network that classifies each pixel in the edge to one or more semantic labels. They demonstrate the results using Semantic Boundaries Dataset and Cityscapes datasets. The work in [11] improves the results of CASEnet by doing full deep supervision.
This paper expands on previous work [12], which was, to our knowledge, the first attempt to teach machines semantic edge segmentation for the perception of deformable objects. In [12], edge detection is successful in finding the corner to be grasped to unfold a towel. In cases where the corner is hidden, however, the unfolding of the garment cannot be completed. In this work, we present a detection and manipulation technique that allows us to identify and grasp a corner that is hidden behind curled up cloth, and then bring the garment to a complete unfolded state.

Cloth manipulation
Feature detection is the approach most commonly used to locate a point to be grasped for unfolding. In [2] they detect the hem and propose grasping points that are later manually selected. If the garment lies on a surface and only presents some wrinkles as in [13], topology analysis can be used to generate a strategy for flattening. Yuba et al. [14] uses a "pinch and slide" action that involves locating a corner, grasping it, and then pinching the edge close to it before finally sliding toward the next corner.
With the advent of deep learning, several studies have tried to solve the cloth manipulation problem. Triantafyllou et al. [15] uses horizontal edges and junctions found in the depth images as grasping points. This approach considers all of the depth edges without distinguishing whether they really belong to a physical edge or are produced by folds or noise. This can lead to selecting incorrect grasping points. Doumanoglou et al. [5] uses random decision forests to learn to find specific points of garments (e.g., the shoulders in a t-shirt or corners in a cloth). To solve a problem where the points are not visible, the authors use a probabilistic action planner to acquire new views of the object by rotating it. However, soft garments, tend to wrinkle in a way that can hide big parts of the object, including these specific points (see Fig.2). In those cases, such points cannot be found, even by rotating the garment 360 degrees.
Similarly to Doumanoglou's method, Corona et al. [16,17] detect specific points for each garment using deep convolutional neural networks to find the grasping points on a garment after a neural network identifies the garment type.
In the work by Hu et al. [18], the authors hold the unknown garment to form one shape from a small set of limited shapes and match it with ones in a database prepared in advance. For bringing the item to such a limited shape, they first grasp the garment by the lowest hanging point and then by the farthest point from the vertical axis through the holding position, considering that the farthest point should be a characteristic point such as a shoulder. This second grasping strategy may not be applicable to all kinds of garments especially in the case of soft garments.

Cloth shape observations
The main problem with working with deformable objects is that the number of configurations they can take is infinite. In order to limit the possible configurations of the garment, we leverage a simple observation to grasp the garment by one of its corners. If the garment is grasped by any random point, then the lowest point of the garment from a frontal view corresponds to one of the corners (see Fig. 3). The same observation was used in [5,17].
Regrasping by that lowest points ensures that the garment is grasped by one of its corners. We thereafter assumed this to be the initial position for all of the experiments. After grasping one corner, we gained insight by looking at how humans manipulate cloth before unfolding it. We found that the first action is often to look for any other contiguous corner and grab it. If the corner is not visible, humans tend to grasp one of the edges and slide the hand towards the corner.

Analysis and categorization of cloth folding patterns
We present a categorization of the possible configurations of a cloth item. Next, we use the result to reveal and grasp the hidden corner. To understand how the garment is folded, expanding on the work in [12], we focus on distinguishing between physical and nonphysical edges, as mentioned earlier in the "Introduction" section.
Based on the edge types, the type of corner made by the edges can be classified. In the method proposed in this paper, we focus on the lateral (leftmost and rightmost) corners of a cloth held in the air and identify its type, as shown in Fig.4.
With one physical corner being held, the bottom point always corresponds to the opposite corner. For the other two corners, there are three possible states: visible, curled forward, and curled backward. To evaluate the state of the corners of a garment, we observe the leftmost and rightmost corners of the perimeter as shown in Fig. 5. If two physical edges are coming out of that corner, it is a real corner (e.g., the right corner in Fig. 5a). If one or more edges are nonphysical, then it is a pseudo corner. In the case of two edges coming out of the corner, the real corner is folded backward (e.g., left corner in Fig. 5a-c).

Fig. 2
Some methods that try to locate specific points like corners; however, cloth tends to curl over, which can hide these points. The green lines are the painted physical edges. Physical edges are detected in the RGB image using color segmentation and are used to generate labels. During training, we only use the depth image, and the neural network never receives this color information If three edges are coming out of the corner, the real corner is folded forward (e.g., right corner in Fig. 5b-e). In the case of a corner folding forward, the actual physical corners are either visible (e.g., right corner in Fig. 5b) or hidden (e.g., right corner in Figures 5c-e). In cases where it is hidden, further manipulation is needed to reveal it before grasping. Figure 6 shows the process that needs to be followed to identify the pattern in the leftmost and rightmost corners.
From this observation, we can see that it is possible to obtain crucial information about how the garment is folded simply by identifying the types of edges leading to the corners in these two points. Figure 7 shows the whole pipeline of the system. First, the robot takes the cloth to the initial position and then, from the depth image, the edges are extracted. Next, the leftmost and rightmost points are located and their folding pattern is classified according to the type and number of edges at each point. Finally, the robot executes an action according to the observation.

Pipeline
In the next sections, we explain the details of each stage.  Cloth edge classification The vision system takes a depth image of a garment as input data and classifies its edges as physical or nonphysical. It consists of two detectors: a local one and a global one. The local one only considers small patches in the image, around the point that it classifies. This is useful for the generalization of other garments, but it lacks the ability to consider the global structure in the current item. For this purpose, we introduce a global detector that takes into account the whole image as it classifies the pixels. Training a neural network requires large quantities of labeled data. Manually labeling the physical edges in thousands of images is not feasible owing to time constraints. To overcome this, we use a semi-automatic dataset generation method. We paint the physical edges of a cloth item (as seen in Fig.2) and then with an RGB camera, we detect and automatically label these edges. It should be noted that we use the RGB images only to generate the labels; this color information is never seen by the neural network, as it only uses depth information. Using this method, we are able to obtain hundreds of labeled images with minimal human intervention. The garment is hung from the robot end effector and rotated while the images are captured. After a full rotation of the garment, the shape of the garment is modified and another round of images is captured.

Image acquisition and preprocessing
We use a Kinect One sensor placed as shown in Fig. 8. The sensor provides an RGB image matrix I(p) and a depth image matrix D(p). Both cameras are calibrated so that each pixel p = (x, y) in the images corresponds to the same location in the real scenario. The camera is also Fig. 6 Categorization of folding states Fig. 7 Sequence of processes for unfolding. First, the edges are extracted from a depth image. Then leftmost and rightmost points are located and classified. According to the type of folding pattern, an action is selected and executed by the robot Fig. 8 The robot holds the cloth by one of the corners placing it between the camera and the robot itself calibrated with the robot so that its position relative to the robot is known.
To remove the pixels that do not correspond to the cloth, we filter by depth, keeping only the pixels that are at a distance Z EE ± γ near the end effector (as shown in Fig. 8). Next, we extract the edges from the filtered image using the Canny algorithm [7]. We denote V d as the set of pixels in the resulting binary image.
The RGB image is only used during training to generate label images {Ŷ (p) 0 ...Ŷ (p) N } . When we train using a cloth with painted edges, we segment each image by color to extract a binary image label in which Ŷ (p) = 1 if the pixel p corresponds to a physical edge. Otherwise, it is zero.

Local detector
As a local detector, we use the same structure as we did in our previous work [12]. Figure 9 shows the way the inputs and outputs to the network are arranged. For each pixel in V d , a patch h(p) of size 50 × 50 is extracted around that point from D(p). The patch size was determined empirically by visually analyzing the images. It corresponds to a size that is big enough to contain some context surrounding the point and small enough to avoid capturing other nearby edges that could affect the classification. Batches of patches are fed into the neural network. After the input layer, we set a convolutional layer (Fig. 9a) with 32 convolutional kernels of size 3 × 3 and stride 1. The next layer (Fig. 9b) is a batch normalization layer followed by a max pool layer of size 2 and rectifying linear unit (ReLU). This structure is repeated in the subsequent layers (see Fig. 9c-d), with a 64-kernel convolution of the same size. The last set of convolution layers (Fig. 9e-f ) consists of 128 kernels of the same size as the previous ones. The output of (f ) is linearly rearranged, forming a vector of length 2048 (g), which is then passed to a fully connected layer of 500 neurons (h). Finally, the output layer (i) has two neurons that activate, indicating the probability of the pixel belonging to a physical or nonphysical edge.
For each batch of N samples X = {h(p 0 ), ..., h(p N )} (with p from the set V d ), the neural network returns {y(p 0 ), ...y(p N )} with y(p) being the probability of pixel p belonging to a physical edge. We then evaluate the binary cross entropy loss:

Global detector
Since the local detector classifies pixels individually without taking into account the full cloth, it is susceptible of presenting discontinuities in an edge. To compensate this effect, we use a global detector that takes into account the whole image and classifies every pixel in the image by using a fully convolutional neural network. Figure 10 shows the structure of the network. The orange boxes represent the feature maps at each convolution layer. The yellow boxes are the feature maps at each deconvolution layer merged with the features from early stages of the neural network (represented by the gray arrows). Each box follows the ResNet architecture [19] and is followed by a batch normalization layer and ReLU activation.
In this case, we formulate the problem as a multi-label problem. Each of the N-label images Ȳ (k) N contains K binary images, one for each of the K categories. We use (1) To compensate for the skewness in the dataset, we use ǫ and (1 − ǫ) , which represent the percentage of non-edge and edge pixels respectively. Similar to other works [10,11] we perform supervision at each stage. Supervision layers (represented by blue lines in Fig. 10) extract feature layers at each stage. We denote the weights as W = {w 0 , ..., w n } for each of the n = 9 layers. The supervised loss is evaluated as the sum of the multi-label loss of each of the individual layers: The final loss L consists of the loss at the output layer and the supervision loss: where is a parameter between 0 and 1 that defines the weight of the supervision in the final loss. (2)

Output
For each pixel, we have two classification results, one coming from the local detector and the other from the global one. We can ponder the outputs to give more importance to generalization or global structure by tunning β.

Action planning
We introduce five actions the robot can take to accomplish the goal of unfolding the garment: Grasp, Rotate, Shake, Follow-Edge, and Unfold. The action Unfold, is the last action (as shown in Fig. 11g) and after that the garment should be in an unfolded state. Otherwise, the process starts again from the beginning. The Rotate action performs a rotation of the garment around the vertical axis by rotating the end effector of the robot arm that holds the cloth. The Grasp action is performed with the free hand by grasping a point on the garment, usually a corner. In the Shake action, the arm that is holding the garment allows it to spread vertically by the effect of gravity. Finally, Follow-Edge, moves the right hand's end effector along one of the physical edges.
The algorithm starts from the initial position i.e., the robot holding one of the garment's corners (Fig. 11a). We assume this position can be reached following the observation in the "Cloth edge classification" section. In other words, the robot first grabs the garment by any point and Fig. 10 The global detector is a fully convolutional neural network with deep supervision. The orange boxes represent the feature maps at each convolution layer. The yellow boxes are the feature maps at each deconvolution layer merged with the transferred features from early stages (grey arrows). The blue arrows represent the feature extraction for deep supervision at each layer then, with the other arm, grasps the lowest point, which corresponds to a corner.
Next, the farthest horizontal point is examined (Fig. 11b). A hanging garment will typically take the shape of a rough triangle, with its hypotenuse along the vertical axis. We showed in the "Analysis and categorization of cloth folding patterns" subsection that this outer corner is crucial to understanding how the garment is shaped.
We then analyze the edges that are connected to the farthest corner. If there are two physical edges (Fig. 11c), the corner in question is a real corner and we can proceed to grasp and then unfold it by extending it.
If it is a pseudo corner, we look more closely at the edge types and determine the type of folding, as shown in the "Analysis and categorization of cloth folding patterns" subsection. If the edge folds backward (Fig. 11d), the corner is probably behind the garment and the appropriate action is to rotate the garment to reveal the corner.
If it folds forward (Fig. 11e), we will move the end effector of the free arm along the trajectory defined by the physical edge to reveal the corner, grasp it, and then unfold the garment.
If the detected edges do not correspond to any of the defined categories, we will perform an action to shake the garment to loosen any folds and extend it by the effect of gravity. Then, we start the process again.

Experimental setup
In all of the experiments, we use a Baxter robot with a Kinect One camera facing each oher, as seen in Fig. 8. The neural networks are implemented using the open software Pytorch [20]. The GPU is an NVIDIA GTX1080 with 8 GB of memory, and the CUDA edition is 10.0. In all of the experiments, unless stated otherwise, = 1 in Eq.4 and β = 0.6 in Eq. 5. Training was done with a garment with painted edges (see Figs. 2 and 3), from which we extract more than 1600 images. This amounted to more than 3.2 million patches.
Each experiment begins with the robot holding a cloth with its right arm as an initial state, then taking actions to unfold it with the left arm (Fig. 13). The camera is calibrated and its position with respect to the robot is known. We conducted three types of experiments. First we analyzed the robot's performance in edge classification and grasping for 20 attempts using the same garment (see Figs. 2 and 3). Then we validated the results of our method by having the robot unfold several previously unseen garments. Finally, to demonstrate the effectiveness of the global detector, we show an ablation study comparing the local + global detector with the local only detector from previous work [17].

Training
The training progress is shown in Fig. 12. The top row shows the loss and accuracy during training and validation. We train for 15 epochs, stopping before any signs of overfitting. To demonstrate that the amount of data gathered is enough, we trained the system with increasing amounts of samples in the dataset. The left graph in the bottom row of Fig. 12 shows that the loss quickly decreases as we increase the size of the dataset. After around 750 samples, the change in the loss relative to the dataset size decreases more slowly, indicating less significance of adding more data. The graph on the bottom right shows the accuracy, which inversely grows at a similar rate.

Classification and grasping
In order to show the effectiveness of the method and determine the stage at which possible errors might occur, we performed 20 attempts to unfold a single garment (seen in Figs. 2 and 3). We studied these attempts to ascertain whether the edge classification had been produced correctly and the grasping and unfolding were successful. The results are summarized in Table 1. Figure 13 shows an example of unfolding when the inner corner is hidden. The robot followed the trajectory of the physical edge to reveal the corner, grasp it, and successfully unfold the cloth. The video in Additional file 1 contains examples of robot cloth unfolding. Table 1 shows the four possible outcome cases depending on the success or failure in corner classification and unfolding. A circle in corner classification column indicates that the corners were correctly classified in all the steps. The first row indicates that 75% of the times the unfolding was successful with correct corner classification in every step. The second row indicates that in the 10% of the cases in which the Edge classification was not successful, the Grasping was. This result is produced in cases where, somewhere in the process, there are errors in the classification, but after some action (like Rotate) the next step led to a correct classification and grasping. In this case, the corner classification was correct in 85.72% of the steps Note that we do not differentiate between success in grasping and success in unfolding because a successful grasp led to successful unfolding in every attempt.

Generalization
We tested the results of the system through experiments using four cloths of different sizes and textures that were not seen during training (shown in Fig. 14). The robot attempts to grasp each garment 20 times and the success rate of edge classification and unfolding are shown in Table 2. For each attempt, we consider that the corner classification is successful if it was correct in all the steps, and the unfolding is considered successful if the physical edges form a square. The success rate represents the percentage of success in the 20 attempts. The cloth A (seen in Fig. 14) reaches 100% in corner-type classification. This garment is, in fact, the most similar to the one used during training. Cloths B and D are smaller and have different folding patterns. Cloth B is the most different in terms of color texture and it has the lowest classification accuracy. That said, the cause of the lower corner classification success ratio is not the color texture, but the tendency of this cloth to curl up and hide its edges more often than others. Images with a physical edge present are correctly classified most of the time regardless of the Fig. 13 Sequence of the robot unfolding a hidden corner folded forwards. By moving the robot's gripper through a trajectory defined by the physical edge (the points from A to B) we can reveal the hidden corner Table 1 Outcomes of 20 grasping attempts (1) Corner classification was correct in 85.72% of the steps An "X" in a cell means the operation in that column was not successful, whereas a circle indicates success. The ratio indicates the percentage of attempts with that particular corner classification and unfolding outcome Corner classif. cloth's texture. Error cases tend to appear in cases with hidden edges. These are more difficult to classify, and an increased tendency of a cloth to curl is the main factor affecting the classification success ratios.

Ablation study
To show the benefits of using both global and local classifiers, we compare the percentage of correctly classified pixels in the edges when taking into account both local and global classifiers with the ablated version using only the local classifier as in [12]. For this experiment, not only the classification of the edges in the corner is considered, but the classification of all the edges in the image. The success ratio represents the ratio of pixels in the image's edges that are correctly classified to the total number of pixels in the image's edges. Table 3 shows the results for each cloth. Again, cloths A and C, which are the biggest in shape (the length is similar and they are rectangular) and more similar to the one used during training in terms of cloth texture, are the ones with the best accuracy. The global detector does not add a big increase in the accuracy, since the results of the local detector are already high. Cloths B and D are shorter and more squared benefit from knowing the whole structure of the garment and significantly improve their results when taking into account the global classifier. Figure 15 shows the most common example of failure in edge detection. Because we are exclusively using the depth image to detect the edges, having the edges very close to another layer of cloth can lead to failure in their detection. This kind of error, however, only tends to happen around the center of the cloth. The proposed method for analyzing the leftmost and rightmost corners generally avoids this kind of error because there is usually a background behind those points and not another layer of the cloth. There are two main possibilities to increase the accuracy in pixel detection. First, by improving the inputs, that is improving the resolution of the sensor or adding more channels (color RGB). The other possibility  is to focus in the design and optimization of a new model of neural network. The most common cause of failure in manipulation and, also the main general cause of failure, is the incapability of finding a solution in the inverse kinematics for the grasping point. This is more common when trying to reveal a hidden corner. To solve this we relaxed the tolerance of the goal position and orientation, and tried to find other configurations within a short distance and angle from the desired configuration. We also use an L-shaped gripper to grasp the edge at an angle, making it easier to find a solution for the inverse kinematics. In order to further improve the results, a more task-specific robot could be designed to better satisfy the task. However, we chose to use a two-arm robot with 7 degrees of freedom in each arm which is a common design.

Discussion
We presented a comparison with our previous method that shows an increase in the accuracy of pixel classification, and most importantly solves a previously unsolved problem: revealing a hidden corner. The method in [5] is similar to ours in that it unfolds the garment while hanging. However, when the corner or feature to grasp is hidden and not found after a full rotation, they restart the whole process by regrasping the garment. We chose to take a different approach to find a strategy to reveal the hidden corner. Other methods like [15,21] use a table to assist in the unfolding operation and are not directly comparable to ours.

Conclusions
We have presented a method for manipulating cloth items that is based on reliable identification of the types of edges in a depth image. Using only depth information makes the algorithm robust to changes in color and texture. This also makes it possible to use the color information to generate a large number of labeled examples for further network training.
Our method recognizes how a cloth is folded by analyzing the types of edges connecting to the leftmost and rightmost corners in the depth image, which facilitates choosing the next appropriate action.
We employed both local and global classifiers to benefit from generalization of the former and the ability of the latter to take into account the whole structure.
The experiments demonstrated that, with a high success ratio (85%), the robot was able to grasp a corner of the cloth in order to unfold it even when the corner was not visible in the image. We also showed how the method can be expanded to include other types of cloth not seen during the training.
Contrary to methods that try to model the whole cloth item in order to manipulate it, we showed that finding and analyzing the edges is a promising way to understand how to manipulate an object with a robot. With further research, this method could be extended to other types of garments.
The main limitation is the restriction to rectangular cloths. Future work can solve this limitation by studying other common patterns found in the edges of other types of folded cloth. For example, a t-shirt often presents similar patterns, but more analysis and strategies are needed to deal with the sleeves.