Design of an eye-in-hand sensing and servo control framework for harvesting robotics in dense vegetation

A modular software framework design that allows flexible implementation of eye-in-hand sensing and motion control for agricultural robotics in dense vegetation is reported. Harvesting robots in cultivars with dense vegetation require multiple viewpoints and on-line trajectory adjustments in order to reduce the amount of false negatives and correct for fruit movement. In contrast to specialised software, the framework proposed aims to support a wide variety of agricultural use cases, hardware and extensions. A set of Robotic Operating System (ROS) nodes was created to ensure modularity and separation of concerns, implementing functionalities for application control, robot motion control, image acquisition, fruit detection, visual servo control and simultaneous localisation and mapping (SLAM) for monocular relative depth estimation and scene reconstruction. Coordination functionality was implemented by the application control node with a finite state machine. In order to provide visual servo control and simultaneous localisation and mapping functionalities, off-the-shelf libraries Visual Servoing Platform library (ViSP) and Large Scale Direct SLAM (LSD-SLAM) were wrapped in ROS nodes. The capabilities of the framework are demonstrated by an example implementation for use with a sweet-pepper crop, combined with hardware consisting of a Baxter robot and a colour camera placed on its end-effector. Qualitative tests were performed under laboratory conditions using an artificial dense vegetation sweet-pepper crop. Results indicated the framework can be implemented for sensing and robot motion control in sweet-pepper using visual information from the end-effector. Future research to apply the framework to other use-cases and validate the performance of its components in servo applications under real greenhouse conditions is suggested.


Introduction
During the design of robotic applications for harvesting horticultural products, two key challenges need to be solved. The first is the detection of a target location of the fruit. The second is moving the end-effector towards that location with precision to perform a harvest action. There are several ways to address each of these challenges. For example, one approach solves fruit detection by using one or few viewpoints from a sensing module located externally from the robot. However, in crops with a high vegetation density, using a low number of viewpoints results in false negatives due to a large amount of occluding leaves and branches (Hemming, Ruizendaal, Hofstee, & Van Henten, 2014b). Furthermore, the external placement of the sensor(s) requires one or multiple frame transformations. Slight errors therein accumulate, resulting in inaccurate target coordinates. When a location of the target fruit is acquired, moving the end-effector there can be solved by executing a planned motion trajectory without additional sensing. However, dislocation of the target can occur as the robot enters and interacts with a dense crop. Both the frame transformation errors and dislocation of the target can result in poor end-effector placement at the target (Hemming et al., 2014a;Henten et al., 2003). An example implementation of external sensing and planned motion control was tested during the European 7th Framework Programme project Clever Robots for Crops (CROPS) (GA no. 246252). During this project, a proof-of-principle harvesting robot was created for a dense sweet-pepper crop. A sensing module dislocated from the robot provided fruit detection from a single viewpoint. Thereafter a motion trajectory was executed without further sensing. It was concluded that this approach was one of the causes of low harvest performance, both in cycle time as well as in fruit detection rates (Bac, 2015). Another example of a cucumber harvesting robot uses a similar approach (Van Henten et al., 2002), where a single viewpoint in the workspace of the robot provided fruit positions. In both the CROPS and cucumber robot, additional sensing could be performed to refine fruit positions with a second set of cameras on the endeffector. However, in both field tests this feature was not used. This extra single sensing step before the final motion execution is also known as look-and-move (Hutchinson, Hager, & Corke, 1996). When a camera is attached to the end-effector, it is often named an eye-in-hand sensor (Hutchinson et al., 1996). For a strawberry harvesting robot, a similar eye-inhand look-and-move approach was used (Hayashi et al., 2010).
A different approach is to solve fruit detection and motion control using primarily eye-in-hand sensing. External sensors are not necessarily excluded in this paradigm, though the application is not dependent on this additional secondary sensing source. For the fruit detection, the internal location of the sensor(s) reduces the number of coordinate frame transformations to a single one. Moreover it allows the application to sense the scene from multiple viewpoints with pose changes of the end-effector, expected to decrease the number of false negative detections in a dense crop. For the motion towards the target, this approach allows for continuous incremental visual feedback and corrections, also known as visual servo control (Hutchinson et al., 1996;Marchand, Spindler, & Chaumette, 2005). Examples of robotic harvesters in horticulture using visual servo control are numerous. For a sweet-pepper harvesting robot in Japan, a visual servo control algorithm positioned the end-effector near the fruit using stereo images (Kitamura & Oka, 2005). Although the camera was not part of the end-effector, it was placed within its workspace and aligned with the optical axis. In a strawberry harvesting robot, a set of external sensors first provided rough fruit position after which an eye-in-hand system moved towards it using visual servo control (Han et al., 2012). For applications of an apple harvesting robot (De-An, Jidong, Wei, Ying, & Yu, 2011) and a citrus harvesting robot (Mehta & Burks, 2014), eye-in-hand visual servo control systems were created. Another application for an apple harvesting robot also applied eye-in-hand sensing, however did not implement a full visual servo control. Instead look-and-move corrections were performed multiple times during the fruit approach (Baeten, Donn, Boedrij, Beckers, & Claesen, 2008).
The aim of our research was to provide a flexible modular framework for eye-in-hand sensing and motion control in robotic harvest applications as a standardised approach. In the aforementioned previous research, the designs of sensing x-coordinate of image centre, pixels v 0 y-coordinate of image centre, pixels u i horizontal position pixel in image, pixels v i vertical position pixel in image, pixels p x horizontal pixel size on camera sensor, m p y vertical pixel size on camera sensor, m b i o s y s t e m s e n g i n e e r i n g x x x ( 2 0 1 6 ) 1 e1 4 and visual servo control algorithms were ad hoc and application specific, therefore hard to migrate to other use-cases. Our approach aims to provide a consistent approach, designed to cope with a wide variety of applications. The framework is primarily designed for dense crops in agri-and horticultural robotics, where a single viewpoint is not sufficient for sensing. Other frameworks for sensing and visual servo control focus on the design of a single low level function (Bachiller, Cerrada, & Cerrada, 2003;Jara, Pomares, Candelas, & Torres, 2014;Mahony, 2011;Marchand et al., 2005;Wu, Lou, Chen, Hirche, & Kuhnlenz, 2010). Our aim is to provide a higher level framework architecture that spans the functionality required for a full robot application, as suggested in previous research (Bachiller, Cerrada, & Cerrada, 2006). This paper firstly provides the general design of the framework in Section 2.1 by describing the required functionalities and architecture of the software and its components. An example implementation for a sweet-pepper use case is then described in Section 3, along with qualitative tests under laboratory conditions. The primary aim of these tests was to i) demonstrate that the framework can execute an eyein-hand sensing and visual servo control sequence and to ii) extend the functionality of sensing with 3D scene reconstruction. The performance of eye-in-hand sensing and motion control libraries are not validated. Section 4 describes the results of our research followed by a discussion in Section 5.

Software
The functions of a robot can be divided into three broad primitives: sensing, planning and acting (Murphy, 2000). To organise the robotic behaviour with these primitives, one of several paradigms can be implemented in a software architecture. For a visual servo control task, a reactive paradigm is most applicable because it routes sensor information directly to actions. However, this omits any planning that an application may need. The hybrid deliberative/reactive paradigm introduces the planning primitive whilst also supporting reactive behaviour. In this paradigm, a global planner executes sub-tasks that can be either planned or reactive, acting as an intermediate coordinator of sensing information (Murphy, 2000). For our framework this paradigm was chosen because both planning and reactive tasks were used. A software architecture, or framework, that implements the hybrid deliberative/reactive paradigm should describe a set of components and their interaction (Dean & Wellman, 1991). For our framework five required functionalities were differentiated that fall into the three primitives of sensing, planning and/or acting: (i) image acquisition, (ii) fruit detection, (iii) application control, (iv) visual servo control and (v) robot control. The functions (i), (iii) and (v) fall into a single primitive. However, functions (ii) and (iv) overlap in the planning primitive because they also process, analyse and plan with data.
In Fig. 1 an overview of the functions for the framework is provided, divided over the robotic primitives.
Flexibility of the framework results from the functional implementation in independent modules. Through such a design pattern, functionality is replaceable and expandable with new features without revisions of other modules. This in contrast to creating a single library that entangles all functionality, resulting in poor affordance to substitution of components and a limited separation of concerns (Felix & Ortin, 2014).
The functions were implemented in the middleware 'Robotic Operating System' (ROS) (Quigley et al., 2009). ROS allows the creation of a modular networks of nodes that perform dedicated subsets of the computation and organises the communication between them. Furthermore, a shared stack of robotic libraries are available to all nodes to facilitate computations for robotics, such as for timing, coordinate frame transformations and robot motion simulations. ROS also provides a set of basic communication policies such as services, publishers and subscriptions as well as a more advanced policy where actions can be monitored or preempted during a continuous feedback loop.
In Fig. 2 the suggested interaction architecture between the ROS nodes of the framework is displayed. Functionalities in the framework were explicitly separated to facilitate replacement of nodes and the extension of new functionalities. Furthermore, centralised functionalities avoid functional duplication across nodes. An additional function of simultaneous localisation and mapping (SLAM) was added to show the extensibility of the framework. The central position  b i o s y s t e m s e n g i n e e r i n g x x x ( 2 0 1 6 ) 1 e1 4 of the application control node in the communication allows for flexible coordination as opposed to distributed control over several interacting nodes. In the following sections, the required functionality of each node will described in detail.

Application control
The application control node fullfilled a coordinating role by communicating with all other nodes and processing their feedback information. This goal was achieved in this node by implementing a finite state machine (FSM) based on previous research (Barth et al., 2014;Hellstrom & Ringdahl, 2013), which has similarity with other FSM approaches like ROS Commander (Nguyen, Ciocarlie, Hsiao, & Kemp, 2013) and SMACH (Bohren & Cousins, 2010). The FSM is a modular collection of states and their respective state transitions. In each state a certain subtask of the robot application can be executed. The use of a state machine gave the framework another layer of flexibility by facilitating the reuse of states, smooth addition of states and rerouting of transitions without requiring recompilation of the framework. The concern of coordination was separated in this node from the other nodes.

Image acquisition
The image acquisition node provided the framework with the functionality of creating a connection to a camera and grabbing colour and monochrome images upon request. ROS required the images to be sent in the ROS image format. The node requires exposure settings and gains for each channel, which can be set in the ROS launch file. To obtain the gain parameters, a colour calibration procedure should be performed by manually adjusting the gain levels until the red, green and blue values of all pixels are equal given a recorded image of a grey calibration reference object. Rectification of the images before sending is required to allow a relation between image coordinates in pixels and real world coordinates in metres accordingly. For this, the camera parameters should be known.

Fruit detection
The goal of the fruit detection node was to provide information about the fruit in a given image. Note that the function of fruit detection is a broad term, to which at least 3 subfunctions can be distinguished that are relevant for harvesting robots: finding fruit, localising fruit in 3D and determining ripeness and/or harvestability. Depending on which sub-functions are required by the application, each subfunction should provide a service that returns a set of features of an image. These features can be descriptive, like surface areas, or geometrical like the position of the largest fruit in the image. In this framework, the visual servo control constrains the image analysis computation time and should be below 100 ms.

Visual servo control
The functionality provided by this node was to use image features to control the motion of a robot, using a continuous correctional feedback loop (Hutchinson et al., 1996) or online trajectory generation (Kr€ oger, 2010). Image information could be used from one of more cameras, either located on the gripper or external from the robot. Independent of the camera configuration, the task required a set of geometrical visual features s to be extracted from the acquired image(s). In our framework the fruit detection node provided this functionality. To use the geometrical features for correcting the motion of the robot, a control law must be designed that realises the desired feature values s * by minimising the error (s À s * ). For this an interaction matrix L s , also known as the image Jacobian, needs to be approximated that models the relationship between the time variation of the features and the camera velocity v (Marchand et al., 2005). Vector v is also known as the kinematics screw vector, encoding the required variation in pose of the camera relative to the object. The general case of an eye-inhand control law where camera velocities are computed is defined by: where l is the proportional coefficient of the exponential convergence of the error and c L þ s is the pseudo-inverse of the estimation of the interaction matrix, which is parameterised by intrinsic camera parameters (focal length, image sensor format and principal image point) and feature location information (m) relative to the camera frame. To add robot motion control to the framework, the Visual Servoing Platform library (ViSP) was wrapped in a ROS node (Marchand, 1999;Marchand et al., 2005), allowing for rapid prototyping of visual servo control algorithms and specifically developed for high-level applications. With Visp a set of elementary tasks can be created by combining visual features. It is designed to be modular, hardware-independent, extendable and portable, making this library highly suitable as a key component for our highly flexible and modular framework. Under the assumption that visual features are defined upon geometrical primitives, such as points or lines, ViSP can approximate the interaction matrix analytically using a previously proposed method (Espiau, Chaumette, & Rives, 1992).
In each iteration of the servo control loop, this ROS node computes the velocity vector v given (i) a set of desired geometric features s * , (ii) a set of current geometric features s and (iii) the convergence coefficient, ranging from 0 to 1. The velocity vector encodes the required change in pose applied to the end-effector to converge to the desired feature values. Note that in some cases convergence and stability problems may occur (Chaumette, 1998).

Robot control
The robot control node provided the functionality to move the end-point of a robot to a desired pose, receiving a real-world Cartesian coordinate and returning a movement status. Visual servo control requires updating the end-effector goal pose multiple times per second. This can be achieved by goals that can be pre-empted or when joint motions of the robot are directly accessible. In Section 3 the way this was implemented in the Baxter robot is described.

Simultaneous localisation and mapping
The aim of this node is to provide additional sensing information from images by implementing a simultaneous localisation and mapping (SLAM) method. Largely used for b i o s y s t e m s e n g i n e e r i n g x x x ( 2 0 1 6 ) 1 e1 4 unknown and unmapped environments, this approach allows for camera pose estimations and three-dimensional scene reconstruction (Durrant-Whyte & Bailey, 2006). From this information, relative depth between objects in the scene can be derived.
For this purpose, the Large Scale Direct SLAM (LSD-SLAM) was used (Engel, Schps, & Cremers, 2014). In contrast to other methods, it runs real-time on modern CPU's and uses a featureless approach working directly on monocular image intensities. The library was already wrapped in a ROS Indigo node, hence no modifications for our framework were required. However, by default only the visual odometry (estimated camera pose) in combination with relative depth map keyframes were published in the ROS system. Therefore the three-dimensional reconstruction was not available outside the LSD-SLAM node. To add this feature in ROS, a bridge node was implemented which concatenated multiple depth keyframes after transformation to the same world frame using the SLAM's published odometry information.

Hardware
The hardware used for testing the framework consisted of a camera attached to the end-effector of a robot. The camera was connected to the computer through USB and the robot was connected to the computer through an ethernet connection.

Robot
The framework was applied and tested on a Baxter robot by Rethink Robotics (Fitzgerald, 2013), depicted in Fig. 3. The robot was designed to mimic and replace workers on a production line, performing tasks such as sorting or picking and placing parts. The human sized robot has 2 mirrored 7 degrees of freedom arms, although only one was used in our setup. The robot was chosen for its native ROS support. Baxter runs a ROS master core to which target joint angles can be published, executed by an internal controller. Baxter also provides inverse kinematics (IK) service for calculating joint angles given a 3 dimensional Cartesian coordinate relative to the robot frame. Furthermore, Baxter publishes the pose of the end-effector and all joints. The standard Baxter end-effector was used.

Camera
A USB CMOS colour Autofocus Camera (DFK 72AUC02-F, TheImagingSource, Germany) was attached on top of the tool centre point of the robot, to which the standard end-effector was also mounted. Images were grabbed by the ROS image acquisition node with a rolling shutter at a resolution of 640 Â 480 pixels. A M12x0.5 mount lens with a focal length of 4.6 mm was attached. Exposure was set to 50 ms, allowing for a frame rate of 20 images s À1 . The autofocus feature was not available under the Linux operating system and was not used.

Computer
The framework was run on a MacBook Pro, 2.4 GHz Intel Core i5 with 8 GB of DDR3 memory operating on Ubuntu 12.04 Precise Pangolin.

Methods
To validate the design of the eye-in-hand sensing and motion control framework, we implemented the software functionalities described in Section 2.1 with the hardware described in Section 2.2 for the dense sweet-pepper crop use case. For this purpose, nodes for the robot control, image acquisition, fruit detection, visual servo control and the application control were implemented to provide the required functionality. All nodes were implemented in Cþþ ROS, version Indigo. For the image acquisition and fruit detection nodes, the industrial machine vision library MVtec Halcon 11.0 (MVTec Software GmbH, 2015) was used by a wrapped ROS Indigo node around the Halcon HDevEngine. Upon initialisation of the ROS nodes, a set of custom Halcon procedures were loaded into memory. The functions were not hardcoded in the source, but specified in the ROS launch file, allowing functions to be updated or replaced without recompilation of the framework. Note that open source image processing libraries, e.g. OpenCV (Culjak, Abram, Pribanic, Dzapo, & Cifrek, 2012), can replace the commercially licensed Halcon library with minor effort. The framework was tested with the hardware in combination with an artificial dense sweet-pepper crop under laboratory conditions. In the following section, the use case is further specified, first describing the use-case specific software implementation of the framework, followed by the experimental setup of the laboratory tests.

Use case description
Sweet-pepper (Capsicum annum) is a high value crop, which is currently manually harvested in high wired greenhouse cultivation systems. Due to their organised, repetitive structure, as seen on the left in Fig. 4 these systems are suitable for adding robots. Unlike other crops, the fruit visibility is low in a single viewpoint due to occlusions by other plant parts (Hemming et al., 2014b). This can be seen in the right image of Fig. 4, where on the foreground a ripe sweet pepper is occluded by the stem and wire and a green pepper is partially occluded by leaves. Furthermore, the location of a sweetpepper can be ambiguous, as a red patch in an image can either be a wholly visible sweet-pepper in the background or a highly occluded sweet-pepper in the foreground. Solving the visibility problem in dense crops requires an approach that uses multiple viewpoints, either before or during the approach to the target fruit. Moreover, whilst executing the motion towards the target, corrections to the path are required because the dense vegetation is easily moved and the target displaced by a robot entering the crop. Target reachability is another issue, as individual plants are spaced between 0.1 and 0.3 m (Bac, 2015). Obstacle avoidance may therefore be required.

Experimental setup
An artificial sweet-pepper crop section was created using plastic imitation fruit and leaves. Although colour and reflective properties were similar to real fruit for the human eye, they differ in other material properties such as hyperspectral information and firmness. However, the materials sufficed for our purposes since only RGB analysis was required. The main nerves of the leaves were fitted with a metal wire, allowing the leaves to be shaped. The leaves and the fruit were attached to a vertically placed thin pole of wood that represented a stem. By shaping the leaves, different degrees of occlusion could be realised. In Fig. 5 a 360 view at 45 increments of a typical setup is displayed. The visibility of the fruit depended on the perspective; the fruit were fully visible in one view and entirely occluded in another. In many views, leaves or stems partially occluded the fruit. The objective of this experiment was to show that the robot could find and access the fruit. Therefore, it was sufficient that the end-effector stopped just in front of the fruit. This was assured by placing the target fruit just out of reach of the maximum robot arm stretch at 1.05 m. The test crop was placed around 10 various locations, within an arc of around 0.5 m in front of the robot. The workspace towards the front of the robot, and therefore the number of test locations, was limited because the robot's workspace was primarily designed for pick and place operations in the horizontal plane. At each location, the occlusion of the test crop was varied and multiple state machine cycles were executed. Qualitative results of the framework's performance on a dense crop were registered and these results will be discussed in the next section.

3.3.
Use case specific framework implementation

Robot control
Two methods for motion control of the Baxter robot were implemented. The first is a ROS actionlib service, which enabled pre-emptable tasks. With this method the status of longer movement actions could be tracked and aborted, suitable for moving to waypoints. The second method is a direct robot joint angle control, allowing to continuously change the rotation of each individual joint. This method provides short motions that can be updated during the movement, suitable for visual servo control. Both methods call Baxter's inverse kinematics ROS service. For this service, a desired pose in real world coordinates can be specified for the end-effector. The service will return a set of joint angles to move the arm to the target location, or gives feedback when it is unreachable or collisions are expected.

Image acquisition
The image acquisition node implemented a connection with the camera through Halcon. A service was provided to send colour or monochrome ROS images upon request. Grabbed Halcon format images were efficiently bridged to the ROS image format before sending. Furthermore this node also visualized grabbed images. In order to rectify the image, Halcon's default procedures were used for multi-view 3D calibration. For this purpose, a set of 100 images was taken of a calibration plate in various locations and orientations. Internal camera parameters were calculated and applied to each new image to remove lens distortion.

Fruit detection
During the research project CROPS, an end-effector was developed that did not require the orientation of the fruit nor an exact position thereof for a successful harvest (Van Tuijl, Wais, & Yael, 2013) (Hemming, Van Tuijl, Gauchel, & Wais, 2014c). This reduces constraints on the fruit detection, allowing for more simplistic and fast approaches which are suited for visual servo control. Other approaches that calculate exact poses or use three-dimensional object matching are generally more time consuming and therefore less appropriate for visual servo control. However, such approaches can be effectively applied, for example in grasp synthesis using A previous approach (Song et al., 2014) for sweet-pepper detection classified image features. A colour based classification provided the regions of interest in multiple images from which maximally stable colour region features (Forssen, 2007) were extracted. Because the computational complexity and temporal performance was not reported, it is unknown if this approach can meet the time constraint in visual servo control.
To implement a simplistic and fast fruit detection for sweet-pepper, an advanced blob detection was created. It started the analysis by converting the image from a RGB to a CIELab colourspace using the equations (2)e(6). Contrary to colourspaces that encode a single axis for colour, CIELab has two axis for colour a,b and one for luminosity l. Because the a axis encodes a spectrum separating green from violet-red, this channel provided distinctive contrast between red sweetpepper and the green surroundings. Note that for other use cases or colours of sweet-pepper, a transformation to the HSI colourspace might be more suitable.
For segmentation of the sweet-pepper blobs, the a channel values ranging from 25 to 70 were selected. This segmentation generally contains noise, which was filtered out by a region opening operation (equal to dilation of an erosion) using a 5 pixel round element, twice as large as the noise to be filtered out. From the largest remaining region, the size and image coordinates of the centre of gravity were calculated and returned to the application control node as features. In Fig. 6 intermediate results of the image processing pipeline applied to an example image are displayed.

Visual servo control
To implement visual servo control for this use case, geometric features needed to be defined. However, high occlusion rates restrict the approach of deriving an object pose as a reliable feature. Instead, only parts of the fruit can be seen from a subset of all viewpoints. The centre of gravity image coordinates of the largest segmented sweet-pepper part was returned as a feature, as described in Section 3.3.3. This can be used as a geometrical feature as it defines a point in two dimensional space. The desired value of this feature was set in the centre coordinates of the image.
In this application the control law for image-based visual servo control and eye-in-hand tasks in Eq. (1) is used. In ViSP the estimation of the interaction matrix for a 2D image feature is given by: À1=z y=z 1 þ y 2 Àxy ; where z is either a known or estimated feature depth in the camera frame. In our application, the estimation of z had a starting value of 0.40 m, as this was the starting distance of the crop scanning as described in Section 3.3.5. This value should be updated in each visual servo cycle. The interaction matrix also requires the positions of visual features expressed in metres rather than image pixel coordinates. For this the previously obtained camera parameters (Section 2.1.2) were used for a perspective projection without distortion model. The parameters were xcoordinate of image centre u 0 , y-coordinate of image centre v 0 , horizontal sensor pixel size p x and vertical sensor pixel b i o s y s t e m s e n g i n e e r i n g x x x ( 2 0 1 6 ) 1 e1 4 size p y . If we define (u i ,v i ) as the position of a pixel in the image, then the position of that pixel in metres in the camera frame can be obtained by: The coefficient of the exponential convergence of the error l in the control law was set to the default value of 0.3.

Application control
The application control node's FSM was implemented for the sweet-pepper use case. Six states were created that each executed a sub-task in the program. The states and their transitions are displayed in Fig. 7.
The program started in the ColdBoot state where all hardand software modules were initialised. When all modules were initialised, the state machine advanced to the Ready state that waited for an external trigger to start a harvest cycle. First, the robot moved to an initial home position, b i o s y s t e m s e n g i n e e r i n g x x x ( 2 0 1 6 ) 1 e1 4 defined as 0.40 m in front of the plant. When this position was reached, the state machine started a sensing procedure in the ScanPlant state. During this procedure the end-effector moved to predefined waypoints in the horizontal plane whilst the end-effector remained facing the plant. In Fig. 8 a visual representation of the procedure is shown. The waypoints were chosen following from a set of constraints consisting of i) the robot workspace, ii) representative greenhouse conditions such as a maximum distance of 0.40 m from the target and a restricted maximum angle of approach around 90 + and iii) to have at least one viewpoint with a fully occluded fruit and one viewpoint with a fully visible fruit. The motion planning involved during the plant scanning phase is a closedloop execution of a planned motion trajectory, provided by the inverse kinematics solver of Baxter as described in Section 2.2.1.
In parallel, the image acquisition node was continuously triggered at 20 Hz to obtain images for (i) the SLAM node and (ii) the fruit detection node. The ScanPlant state continuously saved the pose in which the largest fruit part is detected. This pose was set as the visual servo start pose after the plant was fully scanned, under the assumption that a starting pose with a large fruit visibility from the end-effector would result in a more effective final positioning thereof. If no fruit was found, the state returned to the Ready state. Otherwise the visual servo control loop would commence in the Visual Servo state. Each loop cycle triggered and analysed an image for geometrical features. These features were used to calculate the pose correction vector in the camera frame and when applied to the end-effector, centres the camera with the fruit. For reaching towards the fruit, the end-effector moved along its z-axis with a constant speed of 0.03 ms À1 . Depth information was therefore not required. Because the fruit target was placed just outside the workspace of the robot, the fruit position was determined as reached when the arm was fully extended and the camera was centred with the fruit. The state machine then returned to the Ready state.
It is assumed that the control law improves the positioning of the end-effector with regard to the fruit centre and therefore ensures the fruit does not leave the view during the visual servo approach. However, this cannot be excluded and we have yet to implement a feature that either reverts to a last known pose that includes a view on a fruit or resumes the plant scanning state.
All states after the Ready state could transition to an Error state. For example, in the Home and ScanPlant state this could occur when the given waypoint pose could not be reached. The Visual Servo state returned to the Error state when no fruit was found during scanning. In the Error state each error could be handled depending on the transition. In this implementation it automatically returned to the Ready state whilst prompting the cause of failure.

Results
The framework was implemented and executed for a dense sweet-pepper crop use case. Software source code of the implementation of the framework is available under the BSD License at the repository found at: https://github.com/rbrth/ framework. The execution of the application resulted in the robot (i) scanning the plant for fruit and (ii) a movement towards the centre of a fruit. A video example can be found at: http://dx. doi.org/10.1016/j.biosystemseng.2015.12.001. In Fig. 9 the last frame of this video is displayed, showing the fruit detection segmentation in the top left and the relative depth estimation from the SLAM node at the bottom left. On the right, the final pose of the end-effector is shown, centred with the camera towards the fruit. The execution time of a single successful execution of all states (excluding the error state) was approximately 45 s.

Plant scanning
During the plant scanning state, the fruit detection node continuously analysed images from the end-effector. The achieved rate of analysis was 20 Hz, equal to the image acquisition exposure time. In most cases the visited waypoints provided sufficient viewpoints to find a suitable starting location for the visual servo control, meaning that a surface of the fruit was found.

Servo control
During the visual servo control the fruit became more visible, often entirely. The end pose of the end-effector was always centred with the fruit, except for the instances in which the inverse kinematics solver of the robot failed to find a solution of the joint positions. These occasions were characterised by the robot arm already being fully extended to its limits, but not yet horizontally or vertically aligned with the fruit. In Section 5 the cause and solutions to this phenomenon will be discussed.
In a case where a small patch of the fruit surface was made visible from all viewpoints in the horizontal plane of the fruit, b i o s y s t e m s e n g i n e e r i n g x x x ( 2 0 1 6 ) 1 e1 4 the servo control moved the end-effector in the vertical plane to centre whilst revealing more fruit surface. The result was a curved motion trajectory towards the centre of the fruit. When plant dislocation by a robot entering the crop was simulated by moving the fruit within view of the camera during visual servo control, the algorithm corrected the endeffector accordingly by following the centre of the fruit.

Simultaneous localisation and mapping
During all motions of the robot, the SLAM node ran in parallel to obtain three-dimensional information of the scene and a current estimated pose of the camera. The average publishing rate of respectively new keyframes and pose estimation was on average 5 and 10 Hz As shown in Fig. 10, depth estimations are primarily found on edges in the image. This result is native to the LSD-SLAM library because it operates on image intensity differences. In Section 5 the usability of this result will be discussed. The bridge node merged and aligned multiple pointclouds from keyframes of the LSD-SLAM node, as displayed in Fig. 11.
Again object edges are most discernible, relative depth information within objects without high texture gradients is sparsely available.

Discussion
The primary aim of this research was to design a framework for eye-in-hand sensing and motion control to facilitate the development of new robotic harvest applications, especially in dense crops. On a low level, many stand-alone and functionally dedicated libraries are already available to solve parts of this challenge, e.g. ROS, ViSP and LSD-SLAM. Our framework coherently integrates these parts to provide a higher level functional implementation. It can replace custom solutions by providing a standardised approach that supports a variety of use cases. Flexibility is achieved through separation of functional concerns in different modules (Felix & Ortin, 2014). One of the key aspects of the framework design was to add a high degree of implementation flexibility to meet specific use-case constraints. The secondary aim of our research was to demonstrate a framework implementation for a dense sweet-pepper crop use-case. The example implementation shows the framework can be applied for a sweetpepper use-case with custom hardware. The added sensing functionality of 3D scene reconstruction shows the extensibility of the framework, without interference with the original software. Results also indicate that the framework can be effective for solving sensing and robot motion control in a dense crop based on visual information from the end-effector, although this should be further explored in a quantitative study under greenhouse conditions. Other approaches for solving robot sensing and motion control are not always suitable for agri-and horticulture applications due to dense crop vegetation. For sensing fruit, the occlusions of other plant parts can result in false negatives when only a single viewpoint is used (Hemming et al., 2014b;Van Henten et al., 2002). Furthermore, using multiple poses during harvest attempts increases success rates (Henten et al., 2003). Our framework allows for the acquisition of multiple viewpoints by using the motion of the robot in combination with an eye-in-hand approach. During the motion control towards the fruit, corrections may be required when the robot displaces a target after interacting with the dense crop. It is hypothesised that additional viewpoints during motion towards the target can resolve these problems, as well as provide more detailed information of the target. Eye-in-hand sensing is preferred as the perspective from the end-effector is likely to face the target. Some approaches already devote one or more discrete look-and-move actions to correct for displaced targets or refine rough estimated target positions (Hayashi et al., 2010;Van Henten et al., 2002). The approach implemented in our framework uses continuous corrective actions through visual servo control, thereby enabling more corrections and more target information.
On an abstract level, the design of the framework provided a software architecture guideline to approach robot harvest system implementations. The key element of this design was to separate functional concerns to enable modularity, resulting in a system that facilitates extensions and replacements. For example, substituting a camera only affected a single node and could be achieved by replacing a single line of code without the need of recompilation. Adding a node that analyses shared resources does not require other nodes to be changed. Expanding on previous research (Barth et al., 2014;Hellstrom & Ringdahl, 2013), our implementation of this abstract level was done in the ROS middleware, which is innately modular but it does not separate concerns, or functions, automatically. Here each concern was assigned to an individual ROS node, e.g. the concern of coordination that in itself implements a modular FSM. Thus the implementation of the framework remained dependent on ROS and Linux and therefore the flexibility and usability was constrained. For ROS developers this framework is most interesting, for others it provides a useful abstract architectural design guideline. Although modules were functionally separated, they were not fully functionally independent. A notable example is the fruit detection node that was time constrained by the visual servo node. Such dependencies should be identified and avoided, for example by distributing the computation for visual servo control (Wu et al., 2010). The framework was extended with a SLAM node for depth estimation and 3D reconstruction. The nature of the LSD-SLAM algorithm is to look for differences in image intensities, therefore finding matches at texture edges depending on scene contrasts. For optimal scene reconstruction stationary scenes are needed, scene reconstruction should therefore only be used during stationary scenes, e.g. during plant scanning. Further optimisation of scene reconstruction could be implemented by pointcloud registration methods (Rusu & Cousins, 2011). Our framework provides a guideline and implementation for robotic harvest applications that enables access to visual servo control and real-time sensing in a coherent and flexible approach. The relevance of the framework should be confirmed in other use-cases, under real conditions and with further extensions.
The framework was successfully implemented for a sweetpepper use case and tested under controlled laboratory conditions. Results showed that the application was able to scan an occluded crop for fruit and move the end-effector towards the detected centre of a part or whole of the fruit. A single geometric feature for the servo control algorithm proved sufficient to Fig. 11 e Three-dimensional scene reconstruction. Multiple LSD-SLAM keyframes from different perspectives from the camera on the end-effector were merged in a pointcloud. Visualised in ROS RVIZ with Baxter robot pose. An artificial sweetpepper is placed in front of the end-effector, indicated by the arrow. Other structures in the background can be recognized by their edges. The colour gradient indicates relative depth from the end-effector. b i o s y s t e m s e n g i n e e r i n g x x x ( 2 0 1 6 ) 1 e1 4 centre the end-effector and reach the fruit. Furthermore, no depth information was required for a successful servo motion. This indicates that a simplistic and straightforward approach can solve the motion challenge. However, relative depth estimation or descriptive image features, like fruit size, may be required to determine whether the centre of the fruit has been reached. Another possible method includes an air pressure sensor in the suction cup of the end-effector, which triggers upon fruit contact as described in (Hemming et al., 2014c). The execution time per harvest cycle can be improved. Although the Baxter robot has a maximum speed of 0.6 ms À1 , a high speed resulted in oversteering during our experiments. When the goal was reached, the spring kinematics of the joints dampen the movement, which resulted in brief imprecise positioning. In a visual servo control loop, consecutive over-and understeering therefore produces an increasing spatial oscillation. Decreasing the speed in our experiments resulted in a more accurate movement with no oversteer, eliminating oscillations. For a real world application, it is suggested that a robot with more precise and faster joint controls is required. The use case implementation and experiments showed that eye-in-hand sensing and motion control is a viable approach for robotic harvesting of a dense crop like sweet-pepper and cucumber. To further validate the use case implementation, a more advanced study under real greenhouse conditions is suggested.
Although no quantitative data was collected, the qualitative performance of the visual servo control library indicated that a more advanced study under real greenhouse conditions is viable. Whilst the artificial crop setup provided a good reflection of the occlusion problems faced in everyday practice, it remains a simplification of the real crop situation, as occlusions from stems and fruit clusters were not taken into account. Nonetheless the framework implementation showed that where a small patch was made visible, the robot managed to find the fruit and move the end-effector towards the fruit centre. This indicates that our approach can be effective under real greenhouse conditions. The framework can potentially be implemented for robotic harvest use-cases in greenhouses like sweet-pepper, cucumber, tomato or strawberry, or in a more agricultural setting of for instance apple, citrus or broccoli. The feature of using multiple sensing viewpoints is especially functional for dense crops, as multiple viewpoints may be required for fruit detection. Although a distinction can be made between hard and soft obstacles (Bac, Hemming, & Van Henten, 2013), where the former (e.g stems) must be avoided at all costs but the latter (e.g. leaves) can be displaced by the robot to a certain extent, the use of visual servo control may be restricted by the presence of obstacles. Our use-case was successfully implemented using a single geometric point as feature for the visual servo control. For use-cases that require a fixed end-pose, multiple geometric features can be used. However, this requires absolute depth information to model the interaction matrix in Eq. (1). The LSD-SLAM library provides relative depth information because it does not know the scale of the image. To obtain absolute depth information with this library, a calibration procedure can be performed during the initial start of the application by scanning a structure with known dimensions. In the current growing practice crops are frequently revisited to harvest newly ripened fruit, a possible future extension of this research could therefore be to use scene reconstruction to create a world model. The resulting model could be used to i) retry failed harvesting approaches, ii) skip sensing at harvested points, iii) use a crop growth model to extrapolate the position of ripened fruit and iv) update the model. For creating a world model, reconstruction could be limited to relatively stationary plant parts (e.g. fruit and stems) as opposed to frequently moving parts (e.g. leaves that follow the sun). For creating a subset reconstruction, plant part segmentations could be used (Bac, Hemming, & Van Henten, 2014).

Conclusion
The significance of low level libraries that are available to the robot research community to share and build upon common functionality is evident, but individual libraries tackle only isolated concerns, e.g. ROS as communication middleware or ViSP for visual servo control algorithms. At a higher level, a framework can combine these building blocks coherently to provide an orderly structure and a new dimension of utility for a specific set of use-cases. The aim of our research was to provide such a framework for solving two key issues in robot harvesting applications. The first was sensing in dense crops with high fruit occlusions, which requires multiple viewpoints to lower the number of false fruit detection positives. The second was the motion execution using a visual feedback loop. Implementation of this framework for a sweet-pepper use case with dense vegetation indicated viability of the framework and provided insights for further development. Future research should focus on testing the framework under real greenhouse conditions, different use-cases and by extending on functionality.