Tactile Model O: Fabrication and testing of a 3d-printed, three-fingered tactile robot hand

Bringing tactile sensation to robotic hands will allow for more effective grasping, along with the wide range of benefits of human-like touch. Here we present a 3d-printed, three-fingered tactile robot hand comprising an OpenHand Model~O customized to house a TacTip optical biomimetic tactile sensor in the distal phalanx of each finger. We expect that the grasping capabilities of the Model O combined with the benefits of sophisticated tactile sensing will result in an effective platform -- the tactile Model O (T-MO). Our current T-MO design uses three JeVois machine vision systems, each comprising a miniature camera in the tactile fingertip with a vision processing module in the base of the hand. To evaluate the capabilities of the T-MO, we benchmark its grasping performance using the Gripper Assessment Benchmark on the YCB object set. We then tested its tactile sensing capabilities with two experiments: firstly, tactile object classification on a subset of objects that can be reliably grasped, and secondly, predicting whether a grasp will successfully lift one of these objects under randomly perturbed grasps that sometimes fail. In all cases, the results are consistent with the state-of-the-art, taking advantage of advances in deep learning and convolutional neural networks from computer vision that apply to the tactile image outputs. Overall, this work demonstrates that the T-MO is an effective platform for robot hand research and we expect it to open-up a range of applications in autonomous object handling. Video: https://youtu.be/oZ41U5pyK6Y


I. INTRODUCTION
Tactile afferents in our hands provide information about the state of a grasp and, crucially, whether the grasp is failing [1]. Therefore, bringing tactile sensation to robotic hands will allow for more effective grasping, along with the wide range of benefits of human-like touch to perform tasks we take for granted yet robots are presently incapable of achieving.
Although robotic hands with integrated tactile sensors are becoming more common [2], [3], the vast majority of research on robotic hands still focuses on vision-guided grasp planning. In part, this is because there are a limited selection of commercial tactile-enabled hands, which are expensive and behind the development curve of the most advanced tactile sensors. Thankfully, the advent of fast, precise 3d-printing technologies now means it is inexpensive and relatively simple to adapt tactile sensors and robotic hands into new integrated platforms. Scaling to multiple sensors on a robotic hand presents several challenges, such as how to control the hand with integrated sensor feedback. With those challenges in The authors are with the Dept of Engineering Mathematics, University of Bristol, UK and Bristol Robotics Laboratory, University of the West of England, UK. Email: {jj16883@bristol.ac.uk.
The two lead authors contributed equally to the research in this paper. mind, the presented platform offers a balance of dexterous capability against relative simplicity in its underactuation and fabrication.
The aim of this study is to modify a three-fingered robotic hand, the OpenHand Model O [4], to house the TacTip, an optical biomimetic tactile sensor [5], [6]. Prior work on integrating this sensor with 3d-printed hands has involved two 2-DoF, two-fingered robotic hands: the OpenHand model-M2 gripper with a tactile thumb [7] and the GR2 gripper with two tactile fingertips [8]. The current work represents a major advance in the functionality of 3d-printed tactile dexterous hands to integrate three tactile fingertips within a 4-DoF hand with state-of-the-art grasping capabilities. We expect that the grasping capabilities of the Model O combined with the success of the TacTip as a tactile sensor will result in an effective platform -the tactile Model O (T-MO) -for a wide range of future tactile robot hand applications and research.
One of the advantages of using an optical tactile sensor is that rapid advancements in the field of computer vision can be leveraged for tactile tasks. The leading methods in computer vision and related areas involve deep neural networks, in particular convolutional neural networks (CNNs). Here we apply CNNs to tactile data, using techniques such as data augmentation and regularisation with an appropriate choice of network architecture. These methods have only just entered the field of tactile sensing, achieving impressive results for another optical tactile sensor, the GelSight, that uses a different optical tactile principle than the TacTip [9], [10]. That said, recent application of CNNs to a single TacTip has found highly robust performance for edge perception and contour following tasks [11], indicating the promise of deep learning for robot hands integrated with TacTip tactile sensors.
This study makes several contributions. We further prior work on integrating optical tactile sensors with 3d-printed robot hands by increasing the number of tactile fingertips from two to three, degrees-of-freedom from two to four, miniaturise the tactile sensor to a size suitable for this hand, and improve sensor performance with a better camera (e.g. improving the frame rate up to 120 per second; also each sensor has its own GPU that can pre-process tactile data and even run TensorFlow Lite models). We demonstrate effective performance with an autonomous grasping platform comprising an arm-mounted tactile hand with depth camera for initial pose estimation, which is used to test the hand on a standard (YCB) object set. Finally, we apply deep learning to the tactile data gathered from the autonomous platform, focussing on two representative tasks: object identification and grasp success prediction using only the tactile data. Overall, the T-MO's combined (1) (2) (3)  (2) and (3), which are mechanically coupled and thus constitute a single degree of freedom.
capability at grasping and perception from tactile sensing sets a new level of capability in the field of robot hands.

II. BACKGROUND AND RELATED WORK
Underactuated robot hands greatly simplify the control systems required to grasp objects. For example the IIT-Pisa SoftHand contains five fingers and nineteen joints yet uses only a single actuator to close the hand into a grasping configuration [12]. The hand was able to grasp over 100 objects because its morphology could adapt to each object. Further work has involved development of a tactile human-SoftHand interface for teleoperation [13] and performing safe human-to-robot handover tasks [14] which demonstrates the versatility of the underactuated system.
Another 'highly biomimetic and anthropomorphic' underactuated soft hand more closely followed human hand anatomy [15], but with ten motors to control the 27 degrees of freedom. Teleoperation was used to test the grasping capability on 31 different objects, with an additional test of in-hand manipulation with a whiteboard eraser demonstrated.
Another type of soft hand uses pneumatics, such as the RBO Hand 2 [16] which uses seven inflatable PneuFlex actuators [17]. This design gives a highly dexterous hand, performing all but one of the poses in the Kapandji test [18] for evaluating human hand dexterity.
The underactuated hand modified in this paper is the Open-Hand Model O, an open-source version of the i-HY hand [4] that won the ARM-H track of the DARPA manipulation challenge [19]. The Model O is a 3d-printed, three-fingered, underactuated hand that is one of several hands developed by the GRAB Lab's OpenHand project. Related hands include the Model M2 gripper [20] and Model T hand [21], which are both underactuated, 3d-printed hands with one and four fingers respectively and a single degree of freedom.
Several studies have integrated optical tactile sensors onto two-fingered robotic grippers. Two GelSight optical tactile sensors were integrated onto a two finger parallel gripper and used for slip detection [22]. A slimmer version of the GelSight, the GelSlim, reflects the light signal down the finger to a camera module at its base [23]. More recent papers have continued using the GelSight for tactile object recognition [24] and surface-normal estimation [25].
The majority of work with tactile robot hands has involved solid-state tactile sensors, and has been reviewed extensively elsewhere [3], [2]. The most widely-used platforms include the anthropomorphic iCub hand with integrated capacitive sensors [26], the Shadow hand with BioTac sensors placed at the fingertips [27] and with fabric and capacitive tactile sensors [28], the 4-fingered Allegro hand with PPS capacitive sensors [29] and with BioTac sensors [30] and the TWENDY-ONE hand with embedded force-torque sensors and a tactile skin [31]; this last hand was the first to use deep learning for object recognition. In addition, the i-HY hand used in this study has been integrated with MEMS barometric TakkTile sensors, to give a basic array of 4 pressure-sensitive taxels on each fingertip [4].
Testing of tactile enabled hands has largely taken the form of collecting a dataset of grasps on various objects and using various machine learning methods to try to distinguish between them. Spiers et al. (2016) use 11 objects from the YCB object set using data from an array of barometric pressure sensors attached to a two-fingered hand and classify them using random forests [32]. They obtained a validation accuracy of 94% when the objects' orientations were unconstrained. Flintoff et al. (2018) also use random forests on a two-fingered hand containing barometric sensors and the Google Soli radar sensor to classify 26 objects, obtaining 99% validation accuracy [33]. Schmitz et al. (2014) identify a set of 20 objects with an accuracy of 88% using a deep neural network on the four-fingered TWENDY-ONE hand [31], [34].
Prior work with the TacTip has included integration onto existing robotic hands. Ward-Cherrier et al. (2016) used the M2 gripper, a two-fingered, two-DoF hand, also from the OpenHand project [20], and rolled cylinders across the surface using only tactile feedback to test the dexterity of the M2 [7]. Subsequently, Ward-Cherrier et al. integrated the TacTip onto a GR2, two-fingered gripper [35], also rolling cylinders along a trajectory in the hand's workspace over the TacTip's surface [8].

III. TACTILE SENSOR
This study involves the integration of a soft optical biomimetic tactile sensor in the TacTip family [6] onto a OpenHand Model O, an underactuated three fingered hand [4]. The TacTip was first presented by Chorley et al. (2009) as a flexible moulded urethane hemisphere filled with a clear, compliant, polymer blend sealed with a clear acrylic lens [5]. This design allows the sensor to deform around objects and regain its shape post-contact. Later work on the TacTip family of sensors progressed the design to a 3d-printed skin and base [6], enabling integration into robotic hands.
The principal design aspect of the TacTip family is an array of protruding pins arranged inside the sensor surface, which mimic the dermal pappilae and Merkel cell complexes present in the glabrous (non-hairy) skin of primate fingertips. The epidermis protrudes downwards into the dermis resulting in a series of ridges called the intermediate ridges and the Merkel cells sit at the base of these protrusions. The Merkel cells are densely concentrated in human glabrous skin and are sensitive to static forces [1].
Recent iterations of the TacTip retain the fundamental principles of the sensor but utilise 3d-printing to facilitate fast manufacture and rapid prototyping of new designs. The TacTip surface now consists of a rubber-like 3d-printed surface made from Tango Black+ and supported at its base by a hollow cylinder made from Verro White plastic. The sensor tip is filled with silicone (RTV27905) and sealed with a clear acrylic lens as before.
The pins are printed on the inside of the hemisphere using Tango Black+ and a small tip of Verro White is printed on top of the Tango Black+ 'rods' to make pin detection easier to visualize from a camera. The entire TacTip exterior is printed as a single unit, using both materials, which has allowed complicated pin structures to be tested that would be very difficult to manufacture with other fabrication methods [36]. The sensor used here has 30 pins laid out in 3 columns of 10. The pins have a diameter of 1.2 mm and the pin centres are 3 mm apart.
White LEDs are used to give uniform illumination inside the sensor and reduce the effects of any external light that bleeds through the Tango Black+ skin or Verro White base (both of which are not perfectly opaque). A camera is mounted on the base of the sensor to view the pins as the sensor deforms, with each captured frame first processed in Python using OpenCV. Either the pin positions or the raw camera images can be used as the tactile information, for example as tactile inputs into supervised learning methods for regression or classification over labelled data.

IV. ROBOTIC HAND
The hand chosen for this work was the Model O, developed by the OpenHand project at Yale [37]. The Model O is based on the i-HY hand first presented by Odhner et al. (2014) [4]. The Model O is a three-fingered under-actuated hand with four DoF which, along with other OpenHand designs, has demonstrated great success in grasping by leveraging the morphology of its design [4], [21]. The Model O is mostly 3D-printed (using ABS plastic) and completely open source. This makes  it ideal for this study as manufacture and modification of the design for integration of the tactile sensor is straightforward. The three fingers are almost identical in that they contain two joints and a single degree of freedom, making each underactuated in the same way. A braided polyethylene wire 'tendon' runs the length of the finger and is connected to a Dynamixel MX-28T motor in the base of the hand. This allows the fingers to deform around objects for grasping without the object shape being known by the controller. Springs in both joints cause the finger to passively release when the active force from the motor is removed.
The only difference between the fingers is that one, the 'thumb', is fixed to the palm of the hand, whereas the other two fingers can rotate through 90 degrees from facing the thumb to facing each other (Fig. 1). The rotation of these two fingers is mechanically coupled and therefore constitutes only a single degree of freedom.

V. MODIFICATION OF FINGER DESIGN FOR SENSOR INTEGRATION
Each finger of the Model O has two joints and two phalanges with pads constructed from Vytaflex 30 to give a high friction surface suitable for grasping. We have replaced the distal phalanx of each finger with a tactile sensor such that three sensors are used in total (Fig. 2). This keeps the complexity of the modification low but still involves collecting data from more TacTips than has been tried before. Previously, the TacTip had a hemispherical shape with 127 pins [8], [36], [38]. However, for integration onto the Model O the sensor was modified to closer reflect the shape of the distal finger phalanx of the Model O (Fig. 3). This reduction in size and change of shape (40 mm diameter hemisphere to 20 × 10 mm rectangle) presents two challenges.
The first challenge is that the pin layout needs modifying to be consistent with the shape of the distal phalanx. The number of pins in the sensor is reduced to 30, arranged in three rows of 10, to ensure good coverage of the interior surface whilst also having sufficient separation of the pin tips to be easily distinguishable (Fig. 4). Four LEDs are mounted above the sensor lens in two strips along the major axis of the finger (previously, six were used in a ring).
The second alteration is to the camera system: the small size of the distal phalanx means that, to allow the camera to be attached directly above the sensor, the form factor of the camera board must be small and the lens must have both a wide field-of-view and a small focal length. These requirements led to us using a 90 • non-distortion lens connected to a JeVois machine vision camera system [39]. The camera module is connected to the finger (Fig. 5) and a ribbon cable connects it to the JeVois board which is housed in the 'base' of the hand (Fig. 6). The JeVois is able to capture frames and process them using OpenCV at a variety of frame rates and resolutions ranging from 1280 × 1024 (15 FPS) to 176 × 144 (120 FPS). TensorFlow Lite models can even be loaded directly onto the JeVois, enabling processing by pre-trained deep networks within the hand itself.
This combination is low cost and allows some of the processing load to be transferred from the control PC to within the hand itself. We use a single JeVois module for each of the three TacTips, which necessitated enlarging the base of the Model O. For this study, the base of the hand has a connector unit, for mounting on a six degree-of-freedom robotic arm (UR5, Universal Robots), described in Section VI on the Autonomous Grasping Platform, below.
The camera is held in place by screwing a cap over the top of the board. The ribbon cable emerges from the rear of this cap and feeds directly into the base of the hand where the JeVois processors are housed. The TacTip slots into the phalanx and is held by three screws. This makes replacing the sensor skin very simple when breakages inevitably happen. Overall, the integration of the tactile sensors makes the fingertips heavier, the effect of which is compensated by using stronger springs to hold the fingers in a fully relaxed position when the hand is held with the palm facing down.
Whilst the new fingers are roughly the same width and length as those of the original Model O, they are deeper because the use of a camera with a 90 • field-of-view means that the lens must be at least 20 mm away from the TacTip to view the entire 40 mm surface. This makes the fingertips significantly thicker than those of the Model O which has the effect of removing the ability to slide the fingertips under objects to initiate grasps. For this reason, we expect that the larger size of the fingers presented here will have an impact on the ability of the hand to pick up very small flat objects, but otherwise the functionality of the hand should be similar. As mentioned previously, either the pin positions or raw camera images can be used as tactile data. The images from each camera are captured and processed on its JeVois processor and sent to the control PC. For pin positions, a script on the JeVois is launched remotely that detects the pins (via the Python OpenCV function 'SimpleBlobDetector') and sends them over a serial connection. For raw images, Python OpenCV on the control PC can detect each JeVois as a video source and capture frames directly.

VI. AUTONOMOUS GRASPING PLATFORM
To facilitate testing and data collection, we have built an automated grasping platform for the hand (Fig. 7). The T-MO is mounted as the end effector on a UR5 robot arm (Universal Robots, Denmark), controlled by the PC used to capture and store the tactile data. In addition, a Kinect 2 RGBD camera (Microsoft) is mounted above the workspace to provide a depth image of objects, which we process to give putative pose estimates of use for grasping. A tray is located in front of the base of the arm into which a user can place an object and give the control PC an object label (currently the only human input into this system). This tray has steep sides to prevent objects falling off the platform and a soft (styrofoam) cladding to protect dropped objects.
This platform uses the Kinect depth images to estimate the pose of objects, which guide a simple planner for the armhand system to attempt to grasp and lift test objects. As the object covers only a small part of the tray, we obtain the approximate depth of the tray by averaging the depth across an entire image. The 2d extent of the object is then estimated from the region of the depth image below a threshold 10 mm closer than the surface distance. This region is used to find the centre of mass (x, y) and pose angle from the image moments, corresponding to the major and minor axes of the best-fit ellipse. A z coordinate is also estimated from the minimum object depth to guide the initial vertical hand placement before grasping at 20 mm above the object.
The transformation from the camera frame to the robot frame is according to  with constant translation (a x , a y , a z ) found from an initial calibration (here a x = 643 mm, a y = 0 and a Z = 907 mm) in which the hand is guided onto a known position. For this transformation to hold, the Kinect has been placed pointing vertically downwards with orientation aligned to the x and y axis of the UR5 robot arm. Further control of the hand is available in the rotation of the finger joints. Fingers 2 and 3 can be jointly rotated from 0 • (Cylindrical grasp), through 45 • (Spherical grasp), to 90 • (Opposed grasp). In Section VII the grasp that is likely to be successful is then manually selected from these distinct cases depending on the geometry of the object to be grasped (e.g. Fig.8). In Sections X and XI we define an automatic procedure to choose the rotation of the finger joints. To do this we use the ratio of major to minor axis detected by the outline of the object using the depth image from the Kinect. If an object has a 1 : 1 axis ratio it is assumed that a spherical grasp (45 • ) will be best. Similarly if an object has an aspect ratio of 1 : 3 or higher it is assumed a cylindrical grasp (0 • ) will be best. Overall, we found that this mapping of aspect ratio to [0 • , 45 • ] joint angle worked best for the subset of objects used in this project.
The tactile sensors have an immediate use within this autonomous platform, as they indicate both grasp success and also which fingers contact a held object. After the hand has grasped and attempted to lift an object, the tactile images from the three fingers at the peak of the raising movement are compared to initial non-deformed reference images prior to grasping. For a measure of contact, we use the Structural Similarity Index Metric (SSIM) between each tactile image and its non-deformed reference [40]: where u and v represent a window of N × N pixels (here N = 7) within the two images to be compared, µ and σ represent the mean and covariance of the windows given, and c 1 and c 2 are regularizing constants defined to stabilize the division. The regularizing constants are calculated with c 1 = (k 1 L) 2 and c 2 = (k 2 L) 2 where k 1 = 0.01, k 2 = 0.03 and L is the dynamic range of the pixel values, in this case 255. A single numeric similarity value for each sensor is calculated by averaging this metric over a sliding window taken across the entire image. If these values are below a predefined threshold (here 0.96) for two or more of the sensors then the grasp is considered a success. For more robust detection, this process is repeated over 20 frames and the mode SSIM is taken as a final prediction. The SSIM metric is chosen because we expect it to be robust to lighting changes likely to occur during movement of the hand (which can effect the tactile image). Testing showed that this was a reliable method and produced minimal false positive or negative results. The full method for data collection is given in Algorithm 1.

VII. TEST I: GRASPING VALIDATION
The Yale-CMU-Berkeley (YCB) object set was created to provide a standardised benchmark for robotic manipulation research. The set consists of 77 objects which can be divided into 5 categories including food items and tools [41]. Benchmarking protocols are provided to give a comparison between different platforms to quantify and encourage progress in the field of robot grasping and manipulation research.
Prior to using a scoring metric, which allows for direct comparison with prior research, we tested which objects in the YCB object set the T-MO was able to grasp. The T-MO successfully grasped the majority of the objects in the set only struggling on the larger heavier objects and the smaller flat ones (successful/unsuccessful objects shown in Fig. 9).
The i-HY hand, upon which this hand is originally based, demonstrated its grasping capabilities with a benchmark involving 10 unique objects where all objects were grasped 20 times [4]. Each grasp was performed in an automated manner by detecting the approximate centre of mass of an object using a depth sensor (a Microsoft Kinect, as used here) and an alignment performed to the major axis of the detected object.  For a direct comparison to the i-HY hand, the same experiment has been performed here with our newly modified hand, using the method and platform described above in Section VI. As the set of objects used to validate the i-HY was not standardised, it is difficult to recreate the experiment exactly, so instead a set of similar objects has been used. We found that due to the thickness of the redesigned distal phalanges the technique of driving the hand into the table in order to perform a power grasp was no longer possible. Similarly the compliance of the tactile sensors results in pinch grasps being more subject to failure when given high torsional forces. In combination, this resulted in reduced performance for the Hammer, File and Pen, where the hand was unable to successfully perform a grasp. However, the remaining 7 objects had similar results to the i-HY tests, with at least 18 out of 20 trialled grasps being successful.
To further validate the grasping capabilities of the T-MO hand, a standardised benchmark, the Gripper Assessment Benchmark, provided by Calli et al. (2015) is employed [41]. This test is performed using the platform and setup described in Section VI. However, we forgo use of the depth sensor and instead used a fixed position as described in the benchmarking guidelines. Whilst, to the best of our knowledge, there is no available score on this benchmark for the original Model O hand, this will allow for a direct comparison to other available grippers. Overall, the total score achieved for this benchmark is 202.5 out of 404 and the complete scores for this benchmark are available in Table I.
An extension to the standard Gripper Assessment Benchmark was proposed by Jamone et al. (2016) to include three more categories of objects: Cubic, Cylindrical and Complex [42]. The T-MO scores for this extended benchmark are available in Table II, with total score 176 out of 208.
Overall we found that performance of this T-MO hand is competitive with other hand designs, surpassing the iCub (173/404) and Model T (122/404) Basic Gripper Assessment scores. The T-MO also surpassed the iCub score (164/208) in the Extended Benchmark. Flat objects prove a significant challenge for this hand and result in a drop in overall score. Whilst it would be possible to modify this hand with a nail design to help grasp flat objects, this would require significant modification of the tactile sensing part of the design for the sensors to remain effective, which we discuss further in a section on limitations and future work.
Other GRAB Lab robotic hands do achieve better scores on the standard benchmark. These include the Model B, Model T42 and Model S which score 270, 379 and 402.5 out of 404 respectively. None of these hands have yet been equipped with tactile sensing. To the best of our knowledge, the T-MO is the highest scoring hand with tactile capabilities.

VIII. TACTILE DATA
There are two main methods of processing tactile data captured with the TacTip sensor. The first is to use image processing techniques to track the movement of pins while the hand performs an action and use the pin positions as the tactile output [6], [43], as in previous work integrating the TacTip into 2-fingered grippers [7], [8]. The second technique is to capture the raw camera image and feed this into a neural network with minimal pre-processing.
Recent research has shown that this second technique gives improved results in a contour following task, particularly when concerned with robustness in an online setting [11]. Thus, in the present paper, the main approach will be to use raw images and neural networks; the hand has, however, been designed to accommodate both approaches.
Tactile image data is collected as a video (20 FPS) while the hand performs a grasping motion. Whilst the platform allows for autonomous data collection, it still requires a significant investment of human time and effort for large datasets. Efficient use of the collected data is thus a priority.
One of the methods used to achieve efficiency is to sample multiple sequences of frames from each grasping video. To estimate the appropriate frame, we calculate the absolute pixel difference for each frame in a tactile video when compared with the first frame in the same video [11]. This gives a basic measure of sensor deformity. From this method, it is possible to find the frame that corresponds with approximately 25% deformation of the sensor, which we use to estimate when contact has been made with an object. We then select a sequence of 8 frames by taking every 10th frame after the initial contact frame. Repeating this with a single positive offset in the original frame number, provides additional sets from each tactile video. In total, we take 10 sets of 8 frame sequences for each grasp to increase the size of the dataset.
Before being passed to the network some pre-processing is applied to the data, as follows: 1) Image cropping from the captured resolution of 320 × 240 down to 160×220 resolution to remove outer pixels that contain little or no tactile data 2) Image downsampling to 40 × 60 resolution to retain enough of the information for the specific task while significantly reducing network size. 3) Image concatenation horizontally over the three tactile sensors to give a 120 × 60 pixel image. Some example images are shown in Figure 10, taken from the objects used to test item classification (Section 10). These images are the final frame taken from the extracted tactile sequences, so the sensor should be near its maximal deformation for that grasp.

IX. COMMON NEURAL NETWORK APPROACH FOR CLASSIFYING TACTILE DATA
As we are considering a sequence of tactile images, a 3D Convolutional Neural Network (CNN) is appropriate to capture both spatial and temporal information. Then the network may learn not only the geometric properties of objects being grasped but also physical properties such as compliance during the grasping process. The network architecture chosen for this task ( Figure 11) is a standard combination of convolutional layers passing forward to a fully-connected output stage.
Several methods of image augmentation are employed during training to reduce overfitting. This includes random cropping between 6% and 2% of width and height of an image respectively, random zooming up to an increase of 2% and additive Gaussian noise with variance σ 2 noise = 10 −4 . Random brightness and contrast adjustments are applied using clip(α × pixel + β, 0, 255) where the limits for α and β are randomly selected from the ranges [0.3, 1] and [−50, 50] respectively. These are all individually applied per frame and per sensor to best match the possible environments experienced during testing.
As a regularization technique, we use a dropout of 0.5 on the final fully connected layers. Batch normalization, early stopping and learning rate decay are all used to improve performance. Patience values of 5 and 15 epochs are used for decaying the learning rate on a plateau and early stopping respectively. A factor of 0.25 is the value used for decaying the learning rate. The loss function used is soft-max cross entropy and the optimizer used is ADAM [44], initialisation parameters such as learning rate are given in the subsequent sections.
The collected data is separated into training and validation sets with a 70% to 30% split. It should be noted that this separation is performed on a per video basis and not a per sequence basis. This is done to avoid training and testing on sequences of frames from the same grasp video. A final verification is performed by testing the full system in an online setting by grasping the objects and making predictions on-thefly, as described below.

X. TEST II: TACTILE ITEM CLASSIFICATION
The purpose of our modification of the Model O design was to use tactile sensing for improving the hand's functionality. With this in mind, we first validate the tactile sensing quality by classifying objects using the tactile images alone.
For this test, we chose a subset of 26 items from the YCB object set (shown in Figure 12) that satisfy two properties: first, they must be consistently graspable with the hand and clearly detectable with the Microsoft Kinect depth sensor (which struggles with small or clear objects); second, for added difficulty, some of the objects have been chosen to share either global or local geometric similarities.
To grasp the objects, we rotate the fingers of the T-MO to [0 • , 45 • ] using only the spherical and cylindrical grasps, as described in Section VI. The opposed grasp at 90 • rotation is omitted because it works best with very small objects, which are not present in the objects chosen for this tactile task.
For each of the 26 objects, data was collected for 20 grasps, giving 520 unique grasp videos. The object position was varied for each grasp by randomly replacing the object after a grasp and dropping from a small height. An example of the tactile images obtained for each object used in this task is shown in Figure 10.
Using the methods described in the previous sections the data is processed and fed into a network that is trained to classify between these objects. The architecture of the network follows that previously described with the learning

A. Results for item classification
A confusion matrix showing results over the entire validation set is shown in Figure 13. The overall accuracy found over this set was 93%. The confusion matrix also highlights some misclassifications, which tend to be on objects that share some geometric similarities. For example, the pudding box and sugar box are both rectangular in shape and have depths of 32 mm and 37 mm respectively, giving an edge in the sensor images at about the same location.
The performance on round objects seems reasonable, with most misclassifications not unexpected; for example, depending on the grasp direction, the peach and pear objects are of similar size and curvature. The worst performance is for the golf ball, which we attribute to having fewer frames containing informative contact data because of its small size An online test was then performed to validate that this performance extended past the collected data. During this test, 4 grasps were performed on each of the 26 objects. As described in Section VIII, multiple tactile images sequences are extracted from the grasp video, which are fed into the trained network as a batch. The final prediction is the arg max of the mean prediction over this batch, in effect the most confident prediction over all sequences extracted from a single grasp.
The overall accuracy for this online test was 77% with similar misclassifications as the offline validation described above. Considering there are 26 objects that can be inherently confused with each other, this is good performance but clearly poorer than the 93% accuracy obtained on the validation set. This drop in performance indicates that the training has not generalized fully to off-sample data, most likely because there is a large number of possible positions and orientations that objects can be in. Our expectation is that a larger training set would enable better determination of features for the objects that are able to generalise across different positions and orientations. The present data set is sufficient, however, to show reasonable online performance and indicate directions for future improvement. An important use of tactile sensing within a robot hand is to predict the future state of a grasp. Here we use supervised learning to predict from touch whether an established grasp will be successful when lifting an object. The data is labelled (success/failure) using the SSIM-based measure (Section VI) of whether the fingers are still in contact after a lift attempt.
It has been demonstrated that the T-MO retains the grasping performance of the original Model O for most of the test objects (Section VII). Out of the objects used for tactile item classification (Section X; Figure 12), the grasping success was 97.5% over 520 grasps, which does not give many examples of failed grasps. Thus, to collect a more balanced dataset with both failed and successful grasps on the same object, the tactile data was recollected while introducing random perturbations the hand pose and grasp.
The perturbations to the hand pose included a uniformrandom distributed [−20, 20] mm perturbation to the (x, y)coordinates, [0, 20] mm perturbation to the z-coordinate and [−30 • , 30 • ] axial perturbation. The rotation of the finger joint was also varied randomly between [0 • , 45 • ] and the maximum torque applied to the finger joints between 20% to 35% of the maximum instantaneous and static motor torque (compared with 30% as previously used), which also affected the speed of finger movement during a grasp. Overall, these perturbations reduced the grasp success to 80% of the 520 collected grasps, on the same object set as Test II (Figure 12).
The network architecture for item classification ( Figure 11) is used again, with only small hyperparameter changes of an increase in dropout from 0.5 to 0.75 and a decrease in learning rate from 10 −4 to 10 −6 . These hyperparameter choices were because the increased dropout rate helped reduce overfitting that became more prevalent in this task and the reduced learning rate helped to stabilise learning. The output layer is changed to a two-label classification problem with each grasp predicted as a success or failure according to the mean output of the final softmax layer of the network, with the successful grasp category being greater than 0.5.

A. Results for grasp success prediction
A confusion matrix for grasp success prediction is shown in Table III. On successful grasps, the classifier was almost always correct with 98% true positives, and only 2% of false positive predictions when the grasp was predicted to fail but was actually a success. On unsuccessful grasps, 82% true negatives were correctly predicted, with 18% false negatives of predicting a grasp would succeed when it actually failed.
However, the baseline accuracy for this task is that 80% of all grasps were successful (20% unsuccessful). Therefore, overall the predictor correctly predicted true positives or true negatives on 95% of the validation data (since the 18% false negative rate was on 20% of the data). It is evident that the bias prevalent in the data, which contains fewer unsuccessful grasps, is also evident in the trained network, with the majority of misclassifications being false negatives.
Based on our observations of failure cases, we hypothesised that the motor torque variation was having the most effect on grasp success prediction, which makes sense intuitively because less firmly held objects will provide poorer tactile data. To test this hypothesis, we performed an additional experiment in which the maximum motor torque is varied over grasps from barely touching to firmly holding the object (from 11% to 35% of the maximum instantaneous and static motor torque). Each grasp has an increment of 2% in this torque. This test was performed over four objects having a uniform grasping surface (Baseball, Orange, Bleach, GumPot), to control for factors such as positions of edges and textures on more complex objects.
The results of this test indicate a strong trend between applied force and successful grasp prediction ( Figure 14). Once the grasp has been established and the prediction has been made, then the arm attempts to raise the object and a ground truth of whether the grasp was successful or not is taken. If the prediction matches the ground truth then that point is coloured blue, otherwise the point is coloured red in Figure 14. The overall accuracy achieved in this test matched that of the validation data with 92% of grasps being classified correctly. This separates into 75% accuracy for predicting unsuccessful grasps and 100% for successful grasps.
Overall, the network has learnt that a higher grasping force is more likely to be successful. For each object there is a sharp drop where the network switches from predicting successful to unsuccessful grasps (near normalized torque 0.19). Torques close to this boundary are where incorrect predictions are most common (red markers, Figure 14). High and low grasping force tends to result in correct predictions, which we interpret as from either the tactile information being sufficiently good for an accurate prediction (high torque) or because the classifier recognized there was too little torque to make a successful grasp.

XII. SENSITIVITY ANALYSIS
For a last analysis, we investigated how the T-MO behaves when a reduced number of sensors are used, to give an improved understanding of the tactile information underlying the item classification (Test II, Section X) and grasp success prediction results (Test III, Section XI).
As expected, the best performance for both tests was when using all three tactile sensors ( Figure 15). Also as expected, the performance dropped for all 3 combinations of 2 sensors ( Figure 15, sensors 1,2 and 1,3 and 2,3) and dropped once again when using only of 1 of the 3 tactile sensors (Figure 15, sensors 1, 2 or 3). However, the drop in performance reduction from 3 to 2 sensors was relatively small, with a mean drop of 7% from 93% for Test II and 2% from 95% for Test III, revealing redundancy in the sensor readings. The performance drop was larger from three to a single sensor (mean 18% for Test II and 5% for Test III), but maybe not as large as anticipated.
In Test III, the highest single sensor accuracy for grasp success prediction is found using sensor 1. This is expected because this finger is needed to apply a force that opposes the other two fingers during all successful grasps, and therefore should carry the most relevant information to indicate a failing grasp. However, in Test II of item classification, the opposite trend is visible: sensor 1 scores the lowest accuracy for a single sensor. To explain this we observe that sensor 1 is aligned to the major axis of the object during grasping (Section VI), whereas the other two fingers are rotated depending on the object; therefore, there may be less variation for sensor 1 on the object surface than the other two sensors, and hence less information about object identity.
Whilst there is a significant reduction in performance when using only a single sensor, particularly on the item classification task, an interesting area for further study is how to combine the predictions from all three single-sensor networks. As described in the discussion, the use of singlesensor networks has significant hardware benefits, because it would enable individual predictions to be made using the embedded GPU functionality for each tactile sensor.
Therefore, for a final sensitivity test, we make a prediction voted by networks trained for each individual sensor (rather than having all three images fed into the same network). A prediction is made by taking the mean of all values predicted by the three networks, in effect giving the most confident prediction. When applying this technique to the validation data, the performance improves to 91% in the item classification task (from mean 75%), close to the 93% performance of a network trained directly on all three sensors at once. Similarly, the performance on the grasp success prediction task is 92% (from mean 90%) compared with 95% when combining all three tactile images at once, indicating the utility of individual tactile sensor predictions.

A. Tactile fingertip morphology
Modifying the finger morphology to integrate tactile sensors within an already very capable grasping system (the Model O) Fig. 15: Performance in both the item classification and success prediction tasks whilst using tactile information from a subset of the available sensors. In both cases best performance is achieved when using all three sensors however strong performance can also be achieved with only two active sensors.
was expected to result in impeded performance, as we have seen with a modest drop in grasping capability compared with the original i-HY hand (Section VII). Due to the thickness of the redesigned distal phalanges, the technique of driving the hand into the table in order to perform a power grasp was no longer possible. Even so, the tactile hand still managed to successfully grasp most objects it was presented with, as we summarize later in the discussion.
Future design improvements to the tactile fingertips will focus around a more compact design that would help grasp small objects with a 'fingernail' and larger objects by sliding under them to establish a power grasp. At present, to avoid distortion of the tactile image we use a 90 • field-of-view lens, which constrains the camera to be mounted quite high above the sensing surface. One redesign strategy would be to adopt mirrors that enable the camera to be mounted nearer the joint, as in the Gel Slim [23]. That said, camera technology is going through a stage of rapid miniaturization, and it may be a combination of smaller camera with a wide-angle lens would be a more effective solution. These remain open design questions that are subjects for future work.

B. Tactile sensor design
Here we use the same 3d-printed skin as in other tactile sensors within the TacTip family, having an array of pins spaced approximately 3 mm apart developed originally for a 40 mm diameter, domed tactile sensor [6]. This design choice was for consistency in the absence of a good reason to customize those aspects, unlike for example the overall shape of the sensing surface which required customization to fit onto the Model O fingertip. However, it does result in relatively few pins (a 10 × 3 array) compared with others in the TacTip family (127 pins in the domed sensor). Since these pins are used as sensing elements, this may limit the tactile sensor performance.
There are many design options that may improve tactile sensor performance: the pins could be made smaller, the skin thinner, the pin spacing could be non-uniform, the length or shape of the pins could also be changed, or raised 'fingerprints' used [45]. It is hard to intuit the effect of these choices, so the most obvious approach would be to make several versions and compare on a prescribed task, such as Test II (tactile item classification) or Test III (grasp success prediction). However, even though the platform is autonomous, it does require human supervision/intervention, and so this approach while feasible would be laborious. We anticipate the best tactile sensor design will be task dependent, and so likely different for the two tactile benchmarks considered here.

C. Neural network architecture
In this work, we use a fairly simple 3d convolutional neural network with 3 convolutional layers ( Figure 11). This choice of network was because the patterns required for distinguishing between tactile images are relatively simple, particularly when compared to 1000-class image classification problems that warrant the use of much deeper architectures [46]. Increasing the complexity of the network would also increases the likelihood of over-fitting during training, which would be a problem here due to the expense of collecting large amounts of data. Collecting more training and validation data would allow exploration of more complicated networks, for example to improve the accuracy of the tactile predictions, but this was not the priority in this initial study. A particular area of interest would be to introduce recurrent neural networks for improved use of temporal information, rather than the 3d (2 space, 1 time) convolutions currently used.

D. Validation protocol
Throughout this paper, a standard validation protocol is used to generate results, typically with 70%/30% training/validation data split. In addition to reporting results on the validation set, we use an online dataset to assess performance for both item classification (Test II) and grasp success prediction (Test  III). For item classification, there was an appreciable (16%) drop in performance from the offline validation (93%) to an online test (77%). This performance drop is partly due to the validation being overly optimistic about the online scenario, where uncontrolled changes may occur over time which are not captured in training, such as trial-to-trial variation in tension of the springs and cables, movement of the cameras or skin within the tactile sensors, and other unanticipated changes. We also cannot rule out some over-fitting to the training set, although measures were taken to reduce this (Section IX).
We expect that a larger training set in which the objects are repeatedly presented to the hand in a random order will bring the online and offline results closer together. As mentioned in the previous section on the network architecture, there are other reasons to collect a larger training set, but this was not the priority in this initial study which is aimed at establishing the T-MO's credibility as a tactile grasping system.

E. Utilization of onboard GPU processing
Here we used a JeVois camera system that offers benefits not just in the small size of the camera module but also that it has machine vision processing modules which can be embedded within the body of the hand. In principle, onboard processing could be performed via neural networks loaded on the modules using TensorFlow Lite compatibility. This was not utilized in this study, but would give a route to process and react to captured images without the need for a high-performance control PC. Alternatively, a dimensionallyreduced output, such as that from the convolutional layers, could be sent to reduce the processing requirements external to the hand. This would give a semi-autonomous robot hand the ability to interpret its tactile sense.
For an initial exploration of the potential of this onboard processing, in section XII we examined whether item classification could be accurately performed on the hand. The camera processing module are not directly connected to each other, so the tactile images from each sensor must be processed independently. We showed that the mean prediction from all three sensors is used (rather than processing data from all three), the performance drops by just 2% to 91% for item classification (Test II) and 3% to 92% for grasp success prediction (Test III). This shows that deployment of the classifier directly on the JeVois would likely lead to only a small reduction in performance. That said, further work is needed on the model complexity to find whether the methods are viable to run on the JeVois without overloading its processors. Nevertheless, using all the available onboard processing would allow for the T-MO to be deployed as an autonomous system.

XIV. DISCUSSION & CONCLUSION
We have presented a three-fingered tactile hand that comprises a GRAB Lab Model O modified to include three TacTip optical tactile sensors in its fingertips. Using small camera modules mounted in the the distal phalanx of each finger coupled with the JeVois vision system housed in the 'palm' of the hand and lightweight TacTip sensors, provided an integrated and sophisticated tactile sense that complements the grasping functionality of the hand.
To evaluate the capabilities of the tactile Model O (T-MO), we benchmarked its grasping performance using the Gripper Assessment Benchmark on the YCB object set [41]. We then tested the tactile sensing capabilities with two experiments: firstly, tactile object classification on a subset of 26 objects that can be reliably grasped, and secondly, predicting whether a grasp will successfully lift one of these objects under randomly perturbed grasps that sometimes fail.
In the Gripper Assessment Benchmark we scored similarly (202.5/404) to the Model T and iCub hands (122/404 and 173/404) but were surpassed by other hands produced by the GRAB Lab [47]. This included a test where the iHY hand (on which the model O is based) could grasp all 20 objects and the T-MO could reliably grasp 17 items. The reason for this performance drop is the change to the finger morphology needed to integrate our tactile sensors within the distal phalanx that can be improved in future design iterations of the hand (Section XIII-A).
It is unsurprising that altering the fingers to include tactile sensing would affect grasping performance. The question is rather: given the T-MO is a capable grasping system, what are the advantages of tactile sensing in this hand? To examine this question, we performed several experiments where the tactile data was essential to successful task performance.
When attempting an item classification task (Test II) to recognize objects by touch alone, we obtain 93% validation accuracy on the test dataset of 26 objects, when the object is randomly placed in the tray to cause variations in the grasp. This level of performance is comparable to other studies, such as   [32] (94%) Schmitz (2014) [31] (88%) but is surpassed by Flintoff (2018) [33] (99%) using a combination of radar and barometric sensors. That said, a direct comparison between studies is not possible because the hands and/or tactile sensors differed along with the objects and experiment. In particular, we had the T-MO pick objects off a table, whereas the other studies used statically mounted hands to which objects were passed.
Our other main test of the T-MO was to predict whether a grasp would be able to successfully lift an object after grasping (Test III). From the tactile data obtained from the initial grasp, we obtain 95% accuracy of grasp success on the same 26 objects, but with their grasp poses perturbed randomly so that some grasps fail upon lifting. Calandra et al. (2017) perform a similar experiment and obtain 75.6% validation accuracy when using tactile information from two GelSight sensors [48]. This was performed on a larger dataset with more (and different) objects and the data was split such that an object was only in the training or test set, so again a direct comparison is tricky to make.
Overall, we have demonstrated that the tactile sensors embedded within a robot hand are able to accurately distinguish and categorize objects using only tactile information about the finger surface deformation as it contacts an object. This is beneficial as the ability to classify objects without vision is essential in many scenarios, such as clearing items from a cluttered bin and other objects/situations where a visual snapshot may not be a reliable indicator of the object properties.
To conclude, we have presented a 3d-printed three-fingered underactuated tactile robot hand -the Tactile Model O -which performs well at grasping, tactile object classification and tactile grasp success prediction. We believe this demonstrates that the T-MO is an effective platform for robot hand research, and expect it to open-up a range of applications in autonomous object handling.