Risk-based implementation of COLREGs for autonomous surface vehicles using deep reinforcement learning

Autonomous systems are becoming ubiquitous and gaining momentum within the marine sector. Since the electriﬁcation of transport is happening simultaneously, autonomous marine vessels can reduce environmental impact, lower costs, and increase e ﬃ ciency. Although close monitoring is still required to ensure safety, the ultimate goal is full autonomy. One major milestone is to develop a control system that is versatile enough to handle any weather and encounter that is also robust and reliable. Additionally, the control system must adhere to the International Regulations for Preventing Collisions at Sea (COLREGs) for successful interaction with human sailors. Since the COLREGs were written for the human mind to interpret, they are written in ambiguous prose and therefore not machine-readable or veriﬁable. Due to these challenges and the wide variety of situations to be tackled, classical model-based approaches prove complicated to implement and computationally heavy. Within machine learning (ML), deep reinforcement learning (DRL) has shown great potential for a wide range of applications. The model-free and self-learning properties of DRL make it a promising candidate for autonomous vessels. In this work, a subset of the COLREGs is incorporated into a DRL-based path following and obstacle avoidance system using collision risk theory. The resulting autonomous agent dynamically interpolates between path following and COLREG-compliant collision avoidance in the training scenario, isolated encounter situations, and AIS-based simulations of real-world scenarios.


Introduction
Over the last few years, the promising idea of autonomous ships has gained traction through projects like ReVolt (DNV GL (2020)) and Yara Birkeland (KONGSBERG (2020)).Such research projects are increasingly incentivized as the funding bodies recognize the potential benefits of autonomy at sea.A notable example is the EU-funded four-year project Autoship Horizon 2020, which seeks to speed up the transition towards autonomous ships in the EU (Autoship (2020)).For the first time in history, the promise of lower emissions, higher efficiency, and fewer accidents via autonomy is becoming tangible.
Human error is a leading cause of accidents on the road (Dingus et al. (2016); Thomas et al. (2013)), and reports show that accidents at sea are no different.According to the Annual Overview of Marine Casualties and Incidents published by the European Maritime Safety Agency (EMSA), human error was attributed to over 50% of accidental events between 2011-17 (European Maritime Safety Agency (EMSA) ( 2018)).In addition to reducing accidents (and thereby fatalities), environmental damage, and costs, autonomous marine operations allow for optimized route planning.This optimization can be done with respect to time spent and fuel costs.Furthermore, autonomous ships can move cargo transport from the road to the sea, leading to less trafficked roads.For instance, the autonomous container ship Yara Birkeland is expected to reduce the number of trips made by diesel trucks by 40,000 a year after its launch in 2020 (Skredderberget (2018)).With the widespread electrification taking place, reduced air pollution is another likely and desirable effect.
An overall reduction of errors from introducing autonomy depends on developing robust and reliable systems, which is no trivial task.For autonomous navigation at sea, the vessel's control system must deal appropriately with a wide range of situations depending on the position of the ownship (OS) and other ships within a certain radius and environmental factors such as wind and ocean currents, and waves.Another crucial element is detection, classification, and tracking of objects, which might be challenging in certain weather conditions.The currently proposed solutions generally make significant simplifications and assumptions.Low-level controllers, or autopilots, are already commercially available, but more research on high-level path planning and collision avoidance is needed to ensure safe autonomous navigation in real situations.For collision avoidance, compliance with the Inter-national Regulations for Preventing Collisions at Sea (COL-REGs) is crucial to ensure safety when encountering other vessels.
Due to the complex nature of autonomy at sea, classical model-based methods may be challenging to implement for full autonomy.Modern machine learning (ML) methods are proficient in approximating such complex models.Supervised learning approaches are powerful but limited by their dependency on labeled training data.Reinforcement learning (RL) circumvents this by producing the training data as the decision-making agent interacts with its environment.However, there exists limited research on the combined topic of COLREG-compliant RL controllers.Therefore, this work aims to incorporate a subset of the COLREGs, directly related to collision avoidance, into an autonomous path following and collision avoidance system based on deep reinforcement learning (DRL) conditioned on measures of collision risk.Thus, the research questions are as follows: • Can we use state-of-the-art collision risk theory to define a novel risk-based approach for implementing COL-REGs into an autonomous DRL agent?
• Can the risk-based DRL agent learn to intelligently interpolate between path following and collision avoidance while maneuvering in a COLREG-compliant manner?
Section 2 describes the state-of-the-art in collision avoidance for marine guidance, in which the COLREGs are generally ignored.Section 3 introduces the COLREGs relevant for collision avoidance and essential concepts within guidance and control, collision risk theory, and DRL.Section 4 defines the simulation environments, DRL problem formulation, and evaluation methods.Section 5 presents and discusses the COLREG-compliance and general path following and collision avoidance performance of the resulting autonomous controller.Section 6 summarizes the findings and suggests future work.

Background
Collision alert systems (CAS) aid the captain and crew on board a marine vessel.Such systems primarily extend exteroceptive sensors, converting raw measurements into more interpretable information.Examples of CAS systems are Automatic Radar Plotting Aid (ARPA) and Automatic Identification System (AIS) (compared in Lin and Huang (2006)), routinely used for collision risk evaluation (Xu and Wang (2014)).As we move into the fourth industrial revolution, solutions such as digital twins and remote sensing are making their way into the maritime industry (DNV GL (2019)).Decision-making is thus gradually being taken from the cognitive realm and into the digital domain, and the need for highly robust and flexible guidance, navigation, and control (GNC) systems is growing.Since collision avoidance (COLAV) systems are responsible for one of the most safetycritical aspects of a vessel's operation, any GNC system operating in a dynamic environment requires a robust COLAV strategy (Aniculaesei et al. (2016)).Therefore, reliable and transparent COLAV systems are crucial to reach full autonomy at sea.
All vessels above 300 tonnes engaged on international voyages, all cargo ships above 500 tonnes, and all passenger ships are required to carry an AIS (International Maritime Organization (IMO) ( 2020)).The AIS transmits and receives information such as identity, position, course, and speed, which can be incorporated into a COLAV system.Such systems can thus enhance the quality of information about other vessels but may also depend on the communication infrastructure.
Since one cannot expect complete availability, ships typically utilize additional exteroceptive sensors such as cameras, lidars, and radars.Ideally, an autonomous COLAV system uses AIS information without depending on it.
Before autonomous vessels became a possibility, the International Regulations for Preventing Collisions at Sea (COL-REGs) were formulated to prevent collisions between two or more vessels (International Maritime Organization (1972)).
Although technological advancement has been significant since their publication in 1972, COLREG-compliance for autonomous vessels is still understudied.One of the main challenges is that the COLREGs were written for humans to interpret and require a translation to a machine-readable and verifiable format.Another potential challenge is the indirect communication that occurs when two vessels meet in a situation with a high risk of collision.For instance, the COLREGs require sharp maneuvers for clear communication between vessels when a high-risk situation is encountered.However, this is often not the optimal behavior from an energy efficiency (or even collision risk) point of view.So long as there may be both human and autonomous operators of marine vessels at sea, the autonomous controller should behave in a way that a human-operated vessel can interpret its intent.
In addition to the challenges inherent to the COLREGs, autonomous collision avoidance can be demanding due to the complex dynamics of ships, varying speeds, and changing environmental conditions (Tam et al. (2009)).The majority of the proposed solutions for autonomy make assumptions that do not represent reality.Examples of such assumptions are the constant speed of the OS or other ships, good weather conditions, or that the system only operates while the ship is at open sea.An adequate autonomous vessel must master all the situations the current fleet handles.For instance, given sufficient situational awareness, a full-fledged autonomous COLAV system should be expected to handle situations involving all sorts of moving and stationary objects, from container ships to kayaks.For generalization, the system must track a high number of objects simultaneously and perform well in congested waters.
A plethora of COLAV algorithms and architectures for autonomous control have been, and still are, researched.Here, we distinguish between classical and soft systems (Statheros et al. (2008)).Classical systems find an optimal strat-egy analytically and numerically from mathematical models and logic, which are typically accompanied by convergence proofs.Model predictive control (MPC) can be used to develop COLAV systems compliant with the primary rules of COLREGs.MPC can also be applied to nonlinear systems with uncertain environmental disturbances (Soloperto et al. (2019)).The Velocity Obstacle (VO) method (Fiorini and Shiller (1998)) models artificial obstacles representing the velocities that would result in a collision, and Kuwata et al. (2014) shows that maritime navigation using the VO method can be COLREGs-compliant.Interval Programming (IvP), a multi-objective optimization approach, has successfully produced COLREGs-compliant COLAV systems (Benjamin et al. (2006); Woernner (2014)).Dynamic Window (DW) is an optimization-based method that has been researched for marine applications (Serigstad et al. (2018)), the strength of which can be found in its focus on fast dynamics through reducing the search space to the reachable velocities within a short time interval Fox et al. (1997).
Based on artificial intelligence (AI), soft systems assume that the problem is not readily quantified.Heuristics are experience-based methods for finding an acceptable solution to a problem.The A* heuristic (Hart et al. (1968)) might be the most well-known and widely used soft approach; A* is a greedy search algorithm for finding the shortest distance between two nodes in a graph, in which a heuristic measure weights the edges between nodes.It is often used for highlevel path and trajectory planning, as was done in Eriksen (2019).Another well-known heuristic is the genetic algorithm (GA) based on evolutionary theory.Smierzchalski (1999) applies a genetic algorithm for trajectory planning in an environment with static and dynamic obstacles.Kim et al. (2015) showed that Distributed Tabu Search, a metaheuristic method, can be used for collision avoidance in highly congested areas.Another group of soft systems is machine learning (ML).ML techniques such as deep learning (DL) and reinforcement learning (RL) have recently gotten significant attention in the context of autonomous systems and decision-making problems, as they benefit from neural networks' currently unmatched function approximation capabilities.Model-free RL methods can find a control law even without any mathematical model of the system (Silver et al. (2021)).However, only a limited amount of research has been devoted to autonomous marine vessels compared to driver-less cars, for instance.In Xu et al. (2017), a deep convolutional neural network (CNN) is trained for COLREGs-compliant collision avoidance for a crewless surface vehicle.This method is based on image recognition, using the CNN's ability to process spatially structured data.The Deep Deterministic Policy Gradient (DDPG) algorithm has demonstrated successful path following and simple collision avoidance for marine vessel models (Martinsen (2018); Martinsen and Lekkas (2019); Vallestad (2019)).
Alternatively, COLAV systems can be classified as deliberative or reactive systems (Siciliano and Khatib (2008)).Deliberative systems work in a "sense-plan-act" fashion.Intuitively, reactive systems are then considered "sense-act" systems.Hybrid COLAV systems emerge when combining different system categories, e.g., deliberate and reactive systems.This approach is made with increasing frequency (Ding et al. (2011)).Multi-layered systems are also being developed, where each subsystem lies on a spectrum between reactive and deliberative.Such hybrid architectures are able to harvest the strengths of several methods, using each where they perform best.Loe (2008) applies a two-layered approach where deliberation is done by a Rapidly-Exploring Random Tree (RRT) algorithm combined with the deliberative A* heuristic, and the reactive component consists of a modified DW algorithm.In Eriksen (2019), A* is combined with a mid-layer and a reactive MPC-based algorithm, forming a three-layered COLAV system.Casalino et al. (2009); Svec et al. (2013) have proposed similar layered architectures.
In summary, a wide range of COLAV systems have been proposed in literature, generally disregarding the COLREGs.At the same time, the increased focus on autonomous systems in later years requires COLREG-compliance for sufficient safety.This gap combined with the promise of DRL for autonomous navigation, shapes the objective of the article -to investigate COLREG-compliance in a path following and collision avoidance system based on deep reinforcement learning, conditioned on measures of collision risk.

Dynamics of a marine vessel
The dynamical model considered in this work is CyberShip II: a 1:70 scale replica of a supply ship (Skjetne et al. (2004b)).This model is simulated in a calm ocean surface environment with the following assumptions.
Assumption 1 (State space restriction).The vessel is always located on the surface, and thus there is no heave motion.Also, there is no pitching or rolling motion.
Assumption 2 (Calm sea).There are no external disturbances to the vessel, such as wind, ocean currents, or waves.
Following SNAME notation (SNAME, The Society of Naval Architecture and Marine Engineers (1950)), the navigation state vector then consists of the generalized coordinates, η = x n , y n , ψ T , where x n and y n are the North and East positions, respectively, in the North-East-Down (NED) reference frame {n}, and ψ is the yaw angle, i.e., the current angle between the vessel's longitudinal axis x b and the North axis x n , illustrated by Figure 1.Correspondingly, the translational and angular velocity vector ν = [u, v, r] T consists of the surge (i.e., forward) velocity u, the sway (i.e., sideways) velocity v and yaw rate r.

Vessel model
Given the established assumptions, the 3-DOF vessel dynamics can be expressed in a compact matrix-vector form where R z,ψ represents a rotation of ψ radians around the z naxis as defined by Furthermore, M ∈ R 3×3 is the mass matrix and includes the effects of both rigid-body and added mass, C(ν) ∈ R 3×3 incorporates centripetal and Coriolis effects, and D(ν) ∈ R 3×3 is the damping matrix.Finally, B ∈ R 3×2 is the actuator configuration matrix.The numerical values of the matrices are found in Skjetne et al. (2004a), where the model parameters were estimated experimentally for CyberShip II in a marine control laboratory.
We disregard the ship's bow thruster and allow only the aft thrusters and control surfaces to be applied by the Reinforcement Learning (RL) agent as control signals.This omission simplifies the RL agent's action space and is further motivated by the bow thrusters' limited effectiveness at higher speeds (Sørensen et al. (2017)).Thus, the control vector, f = [T u , T r ] T , consists of the surge force input, T u , and the yaw's moment input, T r .

COLREG rules
Among the 41 rules in the International Regulations for Preventing Collisions at Sea (Organization (1972)), only the directly relevant rules for COLAV are presented below.The two main takeaways from these rules are that 1) the give-way vessel should take early and substantial action, and 2) safe speed should be ensured at all times, such that course alteration is effective towards avoiding collisions where there is sufficient sea-room.Since rules 6 and 8 are particularly tough to quantify, this work focuses on compliance to rules 14-16.
Rule 6: Safe speed Every vessel shall at all times proceed at a safe speed so that she can take proper and effective action to avoid collision and be stopped within a distance appropriate to the prevailing circumstances and conditions.
Rule 8: Action to avoid collision (b) Any alteration of course and/or speed to avoid collision shall, if the circumstances of the case admit, be large enough to be readily apparent to another vessel observing visually or by radar; a succession of small alterations of course and/or speed should be avoided.
(c) If there is sufficient sea-room, alteration of course alone may be the most effective action to avoid a close-quarters situation provided that it is made in good time, is substantial and does not result in another close-quarters situation.
(d) Action taken to avoid collision with another vessel shall be such as to result in passing at a safe distance.The effectiveness of the action shall be carefully checked until the other vessel is finally past and clear.
(e) If necessary to avoid collision or allow more time to assess the situation, a vessel shall slacken her speed or take all way off by stopping or reversing her means of propulsion.
Rule 14: Head-on situation (a) When two power-driven vessels are meeting on reciprocal or nearly reciprocal courses so as to involve risk of collision each shall alter her course to starboard so that each shall pass on the port side of the other.
(b) Such a situation shall be deemed to exist when a vessel sees the other ahead or nearly ahead and by night she could see the masthead lights of the other in a line or nearly in a line and/or both sidelights and by day she observes the corresponding aspect of the other vessel.
(c) When a vessel is in any doubt as to whether such a situation exists she shall assume that it does exist and act accordingly.
Rule 15: Crossing situation When two power-driven vessels are crossing so as to involve risk of collision, the vessel which has the other on her own starboard side shall keep out of the way and shall, if the circumstances of the case admit, avoid crossing ahead of the other vessel.

Rule 16: Action by give-way vessel
Every vessel which is directed to keep out of the way of another vessel shall, so far as possible, take early and substantial action to keep well clear.

Measures of collision risk
The rules presented above are intended for human interpretation and contain ambiguities such as "large enough" (Rule 8) and "substantial action" (Rule 16).How can they be translated into a form suitable for reinforcement learning?An essential first step is recognizing the relationship between the COLREGs and collision risk.The COLREGs are in place to reduce collision risk and indirectly affect the risk level by influencing the probable behavior of the target ship (TS).Since there is a correlation between the rules and the risk level, employing a measure of risk as a proxy for the COLREGs may enable the RL agent to learn COLREG-compliant behavior.
By analyzing the historical trends of measuring collision risk, three main developments can be observed (Xu and Wang (2014)): traffic flow theory, ship safety domains, and collision risk indices.The initial efforts to quantify collision risk were based on traffic flow theory, a method built on empirical studies and statistical traffic analysis in specific waters.
For instance, Cockcroft (1981) investigated the collision rates for ships of varying tonnage relative to their position in a water area.Goodwin (1978) took it further and studied the rate of dangerous encounters.As statistical analysis of historical data was deemed insufficient for dynamic collision avoidance, ship safety domains were introduced.The ship safety domain defines a region around the ship in question that other ships should not enter.Hence, there is a risk of collision if one ship is inside the safety domain of another, and the ship domain can be said to be a generalization of a safe distance (Szlapczynski and Szlapczynska (2017)).When applying the ship domain to an encounter situation in order to determine risk, one of the four safety criteria are normally used: 1) the OS domain should not be violated by a TS, 2) a TS domain should not be violated by the OS, 3) neither of the ship domains should be violated, or 4) ship domains should not overlap, such that they remain mutually exclusive.Rawson et al. (2014) and Wang and Chin (2016) use the latter criterion of non-overlapping ship domains.
It is important to note that a ship domain is usually defined depending on which situation the ship finds itself in to respect the COLREGs.For instance, the domain used while the OS is overtaking another ship is symmetrical, with its origin coinciding with the center of the OS.Conversely, the origin is shifted to the right in a head-on situation, as close encounters on the starboard side should be avoided.Davis et al. (1980) expanded the theory of ship safety domains in their well-known work on ship arenas.The ship arena defines the distances around the OS at which action should be made to avoid a dangerous encounter and is, therefore, larger than the ship safety domains proposed initially.In addition to the OS's length and velocity, the distance to the closest point of approach (DCPA) and the time to the closest point of approach (TCPA) are used to construct the limits of the ship arena.A geometrical representation of DCPA and TCPA are presented in Figure 2, giving rise to the equations and where R is the absolute distance between the OS and TS, and V R and χ R are the relative speed and course between them.In addition, χ OS is the course of the OS, while θ T is the bearing of the TS relative to the OS.

Deep reinforcement learning
Model-free reinforcement learning (RL) methods train a decision-making agent through trial and error, where the agent is gathering experience from an environment supplying only a situational observation state and a corresponding reward.Applications of RL on high-dimensional, continuous control tasks heavily rely on function approximators to generalize over the state space.Even if classical, tabular solution methods such as Q-learning can be made to work (provided a discretizing of the continuous action space), this is not considered an efficient approach for control applications (Lillicrap et al. (2015)).In recent years, given their remarkable generalization ability over high-dimensional input spaces, the dominant approach has been the application of deep neural networks optimized using gradient methods.2021)), we focus our efforts on this method.

Training environment
DRL-based autonomous agents have a remarkable ability to generalize their policy over the observation space, including the domain of unseen observations.Moreover, given the complexity and heterogeneity of the Trondheim Fjord environment, with archipelagos, shorelines, and skerries (see Fig- ure 3), this ability will be fundamental to the agent's performance.However, the training environment in which the agent evolves from a blank slate to an intelligent vessel controller must be representative, challenging, and unpredictable to facilitate the generalization.If not for the generalization issues associated with this approach (Codevilla et al. (2019)), it would also allow the agent to train via behavior cloning based on historical AIS data.However, given the resolution of our terrain data, the resulting obstacle geometry is typically very complex, leading to overly high computational demands for simulating the functioning of the distance sensor suite.Moreover, the agent's perceptive observation space (Section 4.2.2) undergoes significant dimensionality reduction, resulting in the agent not benefiting from such high-frequency details in the simulation.Thus, the better choice is to craft an artificial training scenario with simple obstacle geometries.To reflect the dynamics of a real-world marine environment, we let the stochastic initialization method of the training scenario spawn other target vessels with deterministic, linear trajectories.Additionally, circular obstacles scattered around the environment substitute the real-world terrain.

Observation vector
To facilitate the learning of a decision-making policy, the RL agent requires an observation vector, s, containing sufficient information about the vessel's state relative to the path in addition to situational sensor information.The complete observation vector is then constructed by concatenating navigationbased and perception-based features, which formally translates to s = [s n , s p ] T .In the context of this paper, we consider the term navigation as the characterization of the vessel's state, i.e., its position, orientation, and velocity, with respect to the desired path.On the other hand, perception refers to the observations made via the rangefinder sensor measurements.In the following, the path navigation feature vector, s n , and the perceptive feature vector, s p , are covered in detail.

Navigation features
A sufficiently information-rich path navigation feature vector would be such that it, on its own, could facilitate a satisfactory path-following controller.A few concepts often used in vessel guidance and control are helpful to formalize this.First, we introduce the mathematical representation of the parameterized path, which is expressed as where x d (ω) and y d (ω) are defined in the NED frame.Navigating the path necessitates a reference point, which is continuously updated based on the vessel's position.We define this reference point as the point on the path that has the closest Euclidean distance to the vessel, given its current position, as illustrated in Figure 5.To find this, we calculate the corresponding value of the path variable ω at each time step.This is an equivalent problem formulation because the path is defined implicitly by the value of ω.Formally, this translates to the optimization problem which, using the Newton-Raphson method, can be calculated accurately and efficiently at each time step.We define the cor-responding Euclidean distance to the path, i.e., the deviation between the desired path and the current track, as the crosstrack error (CTE) .Formally, we thus have that Next, we consider the look-ahead point, p d ( ω+∆ LA ), to be the point that lies a constant distance further along the path from the reference point p d ( ω).Look-ahead based steering, i.e., setting the look-ahead point direction as the desired course angle, is a commonly used guidance principle (Fossen (2011)).
The look-ahead distance, ∆ LA , is set by the user and controls how aggressively the vessel should reduce the distance to the path.
We then define the heading error, ψ, as the change in heading needed for the vessel to navigate straight towards the lookahead point from its position, as illustrated in Figure 5. Formally, ψ is defined as where ψ is the vessel's heading and x n , y n are the NED-frame vessel coordinates as defined earlier.
However, even if minimizing the heading error will yield good path adherence, taking into account the path direction at the look-ahead point might improve the smoothness of the resulting vessel trajectory.Referring to the first-order path derivatives as x p ( ω) and y p ( ω), we have that the path angle, γ p , in general, can be expressed as a function of arc-length, ω, such that γ p ( ω) = atan2 (y p ( ω), As visualized in Figure 5, the path direction at the look-ahead point is then given by γ p ( ω + ∆ LA ).We then define the lookahead heading error, which is zero in the case when the vessel is heading in a direction that is parallel to the path direction at the look-ahead point, as Our assumption is then that the navigation feature vector s n , defined as outlined in Table 1, should provide a sufficient basis for the agent to intelligently adhere to the desired path.The navigation features are then formally defined as In our setup, the vessel is equipped with N distance sensors with a maximum detection range of S r , distributed uniformly with 360°coverage.While the area behind the vessel is obviously of lesser importance, e.g., unnecessary to consider when navigating purely static terrain, the possibility of overtaking situations where the agent must react to another vessel approaching from behind makes full sensor coverage necessary.
The most natural approach to constructing the final observation vector would be to concatenate the path information feature vector with the array of sensor outputs.However, initial experiments with this approach resulted in the training process stagnating at an unsatisfactory agent performance level.A likely explanation for this failure is the size of the observation vector, which was fed to the agent's fully connected policy and value networks; as the input size becomes large, the agent suffers from the well-known curse of dimensionality.Due to the resulting network complexity and the exponential relationship between the dimensionality and volume of the observation space, the agent fails to generalize new, unseen observations intelligently (Goodfellow et al. (2016)).An obvious solution is to reduce the observation space's dimensionality significantly.However, simply reducing the resolution is infeasible, as this would accordingly degrade the agent's situational awareness.
In this work, we partition the sensor suite into D sectors, each of which produces a scalar measurement included in the final observation vector, effectively summarizing the local sensor readings within the sector.However, given our desire to minimize its dimensionality, dividing the sensors into sectors of uniform size is sub-optimal as obstacles located in front of the vessel are significantly more critical and thus require higher perceptive accuracy than those located at its rear.In order to realize such a non-uniform partitioning, we use a logistic function -a choice that also fulfills our general preference for symmetry.Assuming a counter-clockwise ordering of sensors and sectors starting at the rear of the vessel, we map a given sensor index, i ∈ {1, . . ., N}, to a sector index, k ∈ {1, . . ., D}, according to where σ is the logistic sigmoid function, and γ C is a scaling parameter controlling the density of the sector distribution such that decreasing it will yield a more evenly distributed partitioning.We can then formally define the distance mea-surement vector for the k th sector, which we denote by w k , according to w k,i = x i for i ∈ {1, . . ., N} such that κ(i) = k Next, we select a mapping f : R n → R, which takes the vector of distance measurements w k , for an arbitrary sector index k, as input, and outputs a scalar value based on the current sensor readings within that sector.The feasibility pooling procedure, introduced in Meyer et al. (2020b), calculates the maximum reachable distance within each sector, taking into account the obstacle sensor readings' location and the vessel's width.This method iterates over the sector's distance measurements in ascending order and checks whether it is feasible for the vessel to advance beyond this level.As soon as the broadest available opening within a distance level is deemed too narrow given the vessel's width, the maximum reachable distance has been reached.Formally, we define f as the feasibility pooling algorithm, and the resulting perceptive distance observation is summarized in Figure 6.To finalize the processing of distance measurements, we introduce the concept of closeness.An obstacle's closeness is zero if it is at a distance further than S r away from the vessel and unity if the vessel has collided with the obstacle.Furthermore, within this range, is it reasonable to map distance to closeness in a logarithmic fashion, such that, following human intuition, the difference between 10m and 100m is more significant than the difference between, for instance, 510m and 600m.Formally, the maximum reachable distance, d, maps to closeness, c(d) : R → [0, 1], according to

Motion detection
The maximum reachable distance in a sector may equal the maximum sensor range even though there is an obstacle in that sector.Thus, by applying the feasibility pooling algorithm to reduce the dimensionality of the rangefinder suite, the resulting closeness observation may fail to inform the RL agent about nearby obstacles.To make the agent aware of nearby moving obstacles, we incorporate the velocities of the nearest obstacle in each sector into the observation vector.Admittedly, while this implementation is trivial in a simulated environment, a real-world implementation will necessitate a reliable way of estimating obstacle velocities based on sensor data.However, even if this can be challenging due to uncertainty in the sensor readings, object tracking is a well-researched computer vision discipline.We reserve the implementation of such a method to future research but refer the reader to Granstrom et al. (2016) for a comprehensive overview of the current state of this field.
Specifically, the decomposition, which yields the x and y component of the obstacle velocity, considers the coordinate frame in which the y-axis is parallel to the centerline of the sensor sector in which the obstacle is present.Thus, we provide the decomposed velocity of the closest moving obstacle within each sector as features for the agent's observation vector.For each sector k, we denote the corresponding decomposed x and y velocities as v x,k and v y,k , respectively.Naturally, if no moving obstacles are present within the sector, both components are zero.

Perception state vector
By concatenating the closeness of the maximum reachable distance and the decomposed obstacle velocity for each sector, we then define the perception state vector, s p , as

Risk-based implementation of COLREGs
In model-free RL, the trained agent will assume a policy that maximizes the expected reward.To lead this policy to adhere to the COLREGs, we must incorporate them into the reward function.As previously mentioned, the rules are ambiguous and cannot be implemented explicitly.Instead, we use collision risk indices (CRIs) as analogs, and the following motivates how they are intended to guide the RL agent towards COLREG-compliance.

Risk-based reward function
Building on the theory presented in Section 3.2, a collision risk index (CRI) is calculated using fuzzy evaluation.Here, this translates to a weighted sum of evaluated risk factors, a method described in detail in Section 4.3.2.This method encapsulates collision risk's continuous and fuzzy nature, making it a convincing choice for translating the COLREGs into a DRL-based framework.Collision risk is typically only applied to encounter situations between two dynamic objects, and the collision risk index to be presented here is no exception.Thus, the reward components for path following, static obstacle avoidance, collision penalty, and living penalty must be defined separately.The corresponding components from a previous approach (Meyer et al. (2020a)) are applied here due to the excellent path following and obstacle avoidance results.
The reward components for path following and static obstacle avoidance are given in Equations ( 14) and ( 15), while the collision and living penalties are negative constants.As a result, the total reward function has the same structure, reiterated in Equation ( 16), except for a risk-based penalty for dynamic obstacles (r colav,dyn ).
The penalty for dynamic obstacles makes part of the overall penalty for collision avoidance, denoted r colav and given by r colav = r colav,dyn + r colav,stat .
For every TS within the OS's sensor range, a collision risk index (CRI) ∈ [0, 1] is calculated (see Section 4.3.2).Since the CRI increases proportionally to collision risk, it can be used semi-directly in the reward function.By multiplying the CRI i of each target vessel, i, with a scaling factor β CRI > 0, the penalty level can be weighted relative to the rest of the reward function:

Calculating the collision risk index
In order to determine the collision risk in an encounter situation, one must first define what constitutes a collision risk and how much each risk factor contributes to the overall risk.The state-of-the-art methods of computing CRIs generally use fuzzy evaluation (Xu and Wang (2014)), making it a natural choice here too.In short, three steps should be followed: 1. Define individual risk factors.2. Define membership functions.
3. Design overall CRI as a function of membership functions.
The chosen risk factors and their membership functions are elaborated on in the following, leading up to the CRI function design.
A common starting point for defining risk is looking at the distance and time to the point of closest approach, denoted DCPA and TCPA.As the descriptive name suggests, the closest point of approach (CPA) is the closest point relative to the OS that the TS in question will come, given that the relative course and relative velocity between the two ships stay the same.The DCPA, then, is the distance to the CPA, while the TCPA is the time until the TS arrives at the CPA.Put differently, the DCPA quantifies the severity of a potential collision situation, while the TCPA quantifies its urgency.When determining the risk level associated with them, it is customary to employ upper and lower bounds for these quantities, denoted The values for the lower and upper bounds depend largely on the application.In general, d L defines the minimal safe encounter distance, and d U is the absolute safe encounter distance (Gang et al. (2016)).For DCPA, the membership function is defined as with d L and d U as positive integers.The DCPA membership function is presented graphically in Figure 7.
For the bounds on TCPA, the method used in Gang et al. (2016) and presented in Equation ( 20) is employed.Doing so adjusts the output of u TCPA according to the distance between the OS and TS, accurately presenting the high risk when the distance is below or close to the lower bound d L and low risk when it is closer to the upper bound d U .It is assumed that DCPA never exceeds d U , meaning that d U is set to the maximum detectable DCPA.
In Gang et al. (2016), equal importance has been given to positive and negative values of TCPA through the membership function below: However, noting that negative values of TCPA indicate that the OS and TS have passed each other, it makes sense to pay attention to the sign of TCPA.This is supported by Park et al. (2006), where a fuzzy case-based reasoning system for collision avoidance is proposed.In their work, the TCPA membership function in Figure 8 is applied, indicating the significantly higher risk associated with positive values of TCPA.Following this line of reasoning, a distinction between positive and negative values of TCPA is made according to Equation ( 22).The cut-off value for negative values (negative limit) was chosen as t NL = d L v R , such that the degree of membership is larger than zero whenever the OS is less than t NL time steps away from the TS.The membership function for TCPA is plotted in Figure 9. Further, the collision risk depends on the position of the TS relative to the OS, which can be expressed through the absolute distance, R, between them and the bearing angle of the TS, θ T .Since the risk is higher on the starboard side of the OS, as expressed in Rule 14 (head-on situation) of the COL-REGs, the membership functions should be designed with a bias on that side.Inspired by Davis et al. (1980), it is customary to introduce a bias of 19°starboard.Davis developed the concept of ship arena, briefly described in Section 3.2, and designed a scaling of the upper bound: while the lower bound is usually 12 times the OS length L pp (Gang et al. (2016)) but set to 8L pp here due to the smaller scale.Initially, the upper bound given by R D was implemented, but it quickly became apparent that adjustments had to be made to ensure that the agent received sufficiently negative reward when approaching TSs, regardless of their bearing angle.The difference in scaling of 4.4 times for ships detected at 19°and 161°(180°− 19°) was too large considering the relatively densely populated training and testing scenarios and a restricted sensor range of 1500 m.Through testing, it was observed that the distance membership function could be made uniform while still preserving the correct behavior in head-on situations as long as the membership function for the bearing angle, θ T , was given enough weight.As a result, the lower and upper bounds for the absolute distance, R, were chosen as with β RL and β RU chosen as appropriate scaling constants.Following the logic applied to the membership functions for TCPA and DCPA, we arrive at the following membership function for the absolute distance between the OS and TS: To encourage the appropriate behavior in head-on situations, the function for the bearing angle of the TS relative to the OS should be largest on the starboard side.Defining θ PU , θ PL , θ NU , and θ NL as the positive upper, positive lower, negative upper, and negative lower bounds on θ T , the membership function for the bearing angle can be defined as below and illustrated in Figure 11.After implementing a CRI containing the four membership functions introduced so far, it became clear that it was necessary to add an element to the CRI to deter the OS from crossing ahead of a TS.Since the TS's speed towards the OS can quantify whether the OS is ahead of the TS and is readily available in the observation vector (v y and v x ), an additional membership function is designed.Hence, we define u V (•) as the ratio of the TS's speed towards the OS to its absolute speed, as described in Equation ( 27).Such a ratio was chosen to avoid issues with differences in speed among the TSs, which quickly could have arisen if the numerical value of v y had been used instead.On the other hand, it might be desirable to distinguish between crossing ahead ships traveling at different speeds, as faster ships naturally pose a higher risk.However, this is considered to be outside the scope of this work.It is worth noting that u V (•) is negative when v y is negative, emphasizing the advantage of astern crossings.
Integrating the introduced membership functions into a collision risk index, we have that where the CPA composite term was designed in such that a combination of low values for both DCPA and TCPA gives rise to a high CRI.It also accurately expresses how a low value of either DCPA or TCPA significantly reduces the overall risk.The max-function is applied to ensure that the CRI is always larger or equal to zero.
Finally, values are assigned to the weights such that the sum is equal to unity, giving In this work, the parameter values specified in Table 2 are used.Initial choices were made based on values suggested in the literature (Chen et al. (2014); Yan (2002)), emphasizing DCPA and TCPA.However, it was discovered that more weight had to be placed on the target bearing angle, absolute distance, and approaching velocity to achieve the desired behavior.The configuration of the path following and static obstacle rewards listed in Meyer et al. (2020a) have been applied in this work.

Performance evaluation
A three-step evaluation process is employed to assess the performance of the RL agent.First, the agent's behavior and performance in the training environment are assessed, and snippets from situations relevant to rules 14-16 of the COLREGs are presented.Next, two-vessel testing scenarios are constructed to test for COLREG-compliance specifically.Lastly, the agents are evaluated in AIS-based environments.These modes of assessment are described individually in the following subsections.
4.4.1.Performance in the training environment A natural starting point for performance evaluation is assessing the agent's behavior in its training environment.The overall performance can be evaluated by collecting statistics on the collision rate, level of path completion, and reward.These statistics serve as a guide for when to stop the training and a point of comparison between approaches.Moreover, a qualitative assessment is made by observing the agent's behavior through video recordings.Snippets are chosen from the videos to highlight the behavior in situations where the COL-REGs apply.This is not always the case since the training environment often presents the agent with difficult situations containing various static and dynamic obstacles, which cannot be accurately subjected to the COLREGs.

Testing of COLREG-compliance
The next step in the testing process is subjecting the agent to scenarios specifically designed to capture COLREGcompliance.This is especially useful since it is challenging to find scenarios that perfectly showcase COLREG-compliance in the training environment.However, the agent's success can easily be quantified through simpler two-vessel scenarios.One scenario to be tested is self-evident, namely the head-on scenario.In addition, two different crossing situations, one from the starboard and one from the port side, were chosen.
For each scenario, the TS's initial angles and path angles are varied slightly within a range of ±5°of the default angles, which allows for an accumulation of statistics on success rate in the respective scenarios.It should be noted that the target ships have been modeled exclusively large in the testing scenarios to reflect the size of the large ships encountered in the AIS-based scenarios and for visual clarity.

AIS-based testing
Lastly, three environments based on real-world high-fidelity terrain data are used to assess the generalization performance of the agent.These environments were developed by Meyer (2020) using AIS tracking data and terrain data from the Trondheim Fjord area and are distinctly different.A dashed black line represents the desired OS trajectory in the following illustrations.Each TS is drawn at its initial position, and trajectories are drawn as dotted red lines.Note that these are examples of spawned environments and that a set amount of target ships are chosen from the AIS database each time an instance of the specific scenario is created.Additionally, the apparent density of TS trajectories does not directly reflect the number of encounters, as this depends on the speed of each vessel.
The first AIS-based scenario is the Trondheim scenario (Figure 15b), in which the agent is required to cross a fjord of width ∼12 km while following a straight path.Doing so, it mainly meets crossing traffic consisting of larger vessels.In the challenging Ørland-Agdenes scenario (Figure 15a), the agent encounters two-way traffic in a narrow fjord entrance region.It must blend into the heavy traffic to complete the path while avoiding head-on collisions.In addition, the abil- Absolute safe encounter distance 1500 m ity to overtake other vessels is assessed.As in the Trondheim scenario, the vessels are primarily bigger than the OS.Lastly, the Froan scenario (Figure 15c) offers a demanding terrain with hundreds of small islands.As a result, it tests the ability of the agent to generalize to a challenging environment with a high density of static obstacles in varying shapes and sizes.
The area is less trafficked, and the vessels encountered are physically similar to the OS.

Results and discussion
In this section, the results from the risk-based implementation of the COLREGs are presented and evaluated.First, the RL agent is evaluated in the synthetic training environment, considering its general path following and collision avoidance performance.Second, the presence and consistency of COLREG-compliant behavior are assessed in isolated, highrisk encounters.Finally, the agent is presented the simulated real-world AIS-based scenarios to see how the learned policy generalizes to complex and unseen situations.

Training and testing in the synthetic environment
After training the RL agent in the synthetic environment (Figure 4) for approximately 4000 episodes, its collision rate dropped to near zero, and the progress rate rose to 100%.Snippets from the training environment have been included in Figure 12, showcasing training scenarios in which the agent behaves in a COLREG-compliant manner.The COLREGs clearly define these situations: passing on the right in head-on situations, slowing down and passing astern instead of ahead, and allowing space between it and the TS during overtaking.
Although the training statistics indicate the agent's ability to navigate and avoid collisions, they do not reveal whether the COLREG-compliance is consistent, which must be evaluated separately.

Testing of COLREG-compliance
The next step in the evaluation process is COLREGcompliance testing with repetitive testing in different en- counter scenarios.Figure 13 shows how the agent avoids collision in a COLREG-compliant manner.In addition, the agent follows the path well once the encounter has passed.Repetitive testing reveals that these results are stable, as the correct behavior was seen in 100% of the episodes for each testing scenario, as summarised in Table 3.These results indicate that the agent can intelligently interpolate between path following and COLREG-compliant collision avoidance in isolated high-risk encounters.However, there is no guarantee that this behavior translates into more complex scenarios.

Testing in AIS-based environments
Finally, the risk-based agent is assessed in AIS-based realworld environments to find how well the agent generalizes to previously unseen scenarios.As these environments are modeled using real-world terrain mapping and AIS traffic data, the agent will likely encounter complex scenarios where the COLREGs do not clearly define the correct behavior.Therefore, we do not expect the agent always to find a COLREGcompliant solution.The agent's excellent static obstacle avoidance and COLREG-compliant behavior are highlighted in Figure 14.Note that the static obstacles in Figure 14d are significantly smaller than those encountered in the training scenario.
Lastly, trajectories from each environment are presented in Figure 15.These trajectories illustrate the agent's ability to dynamically follow a predetermined path in the face of static and moving obstacles.Whether the agent is faced with heavy two-way parallel traffic (Figure 15a), crossing traffic (Figure 15b), or an untraversable path (Figure 15c), it adapts to the situation and finds a suitable solution.Thus, the agent generalizes its decision-making policy from the synthetic and stochastic training environment to previously unseen environments.

Conclusion
The primary objective of this work was to investigate whether COLREG-compliance in a path following and collision avoidance system based on model-free DRL is possible using a risk-based approach.In summary, we found that • using state-of-the-art collision risk theory, the ambiguity of the COLREGs (rules 14-16) can be circumvented by conditioning the decision-making agent on collision risk indices as a proxy for the COLREGs.
• the agent learned complex rules by training in a stochastic and synthetic training environment, which translated well into real-world testing environments.Moreover, the approach produced COLREG-compliant behavior when tested in isolated encounters.
As described in Section 4.2.2, the agent perceives its surroundings by summarizing the information from N distance sensors into D sectors using feasibility pooling.Consequently, the agent directly reads the decomposed velocity of the nearest TS in each sector to ensure their detection.This approach leads to significant information loss, as the agent can neither observe high-frequency details, such as the size of the TS, nor if there are multiple obstacles in a single sector.The curse of dimensionality, which motivates the feasibility pooling algorithm, can potentially be avoided by applying a convolutional neural network (CNN).Unlike fully connected networks, CNNs directly utilize spatial information in structured data.Thus, a CNN can exploit the fact that there is a strong correlation between neighboring sensor measurements, given that the resolution is high enough.Furthermore, by stacking multiple observations temporally, the observation then contains sufficient information for the agent to infer any obstacle's relative velocity and acceleration from the distance measurements alone, removing the need for object tracking and velocity estimation methods.
Additionally, the COLREGs must be adapted for machine readability and interpretability, clearly defining the required behavior in different environments and situations.Without doing so, it will be impossible to accurately assess the success of an autonomous vessel and claim that it is fully COLREGcompliant.Nevertheless, the results obtained in this work imply the potential of DRL to handle COLREG rules through risk-based conditioning.Thus, once the COLREGs are modernized for digital applications, DRL can be expected to produce COLREG-compliant and autonomous collision avoidance systems.

Figure 1 :
Figure 1: Illustration of the NED and body coordinate frames.

Figure 3 :Figure 4 :
Figure 3: Snapshot of the marine traffic from 01.01.2020 to 06.02.2020 in the Trondheim fjord, based on AIS data.Each red line represents a recorded travel.

Figure 5 :
Figure 5: Illustration of key path-following concepts in vessel guidance and control.The path reference point, p d ( ω), describes the point on the path with the closest Euclidean distance to the vessel, while the look-ahead reference point, p d ( ω + ∆ LA ), is located a distance, ∆ LA , further along the path.

Figure 6 :
Figure 6: Rangefinder sensor suite containing N distance sensors, partitioned into D sectors (black dashed lines) according to the mapping function κ.The dashed edges illustrate the maximum reachable distance in each sector, as calculated by the feasibility pooling algorithm.The perceptive distance component of the RL-agent's observation space consists of the closeness mapping of these distances.
d L and d U for DCPA, and t L and t U for TCPA.Doing so, the membership functions u DCPA and u TCPA output unity (highest risk level) whenever |DCPA| ≤ d L and |TCPA| ≤ t L , respectively.Conversely, their outputs are zero when |DCPA| ≥ d U and |TCPA| ≥ t U .As was done in Gang et al. (2016), a secondorder function is used between the two extremities.Chen et al. (2014) use a sinusoidal function instead.Although the latter has the virtue of being smooth, it was deemed inexpedient due to the large outputs for a wide interval of values, overshadowing other elements of the CRI.Since the sensor range used in this work is relatively short (1500 m), the steeper secondorder function improved learning.It is worth noting that the sinusoidal function may be more suited in a setup with fewer obstacles and vessels where AIS data from a larger region is used.

Figure 7 :
Figure 7: Membership function for DCPA with d L = 320 m and d U = 1500 m.

Figure 9 :
Figure 9: Membership function for TCPA with d L = 320 m, d U = 1500 m and v R = 1 m/s.

Figure 10 :
Figure 10: Membership function for distance to the target ship, with θ T = 0°.

Figure 12 :
Figure 12: Risk-based agent performing common naval collision avoidance maneuvers in the training environment.The agent's trajectory is drawn as a blue dashed line, and the target ships with trajectories are drawn in red.The dotted vessel outlines show their positions 100 time steps prior.

Figure 13 :
Figure 13: Agent behaviour in COLREG-compliance test scenarios.The agent's trajectory is drawn as a blue dashed line, and the target ships with trajectories are drawn in red.The dotted vessel outlines show their positions 100 time steps prior to the present time.

Figure 14 :
Figure 14: Risk-based agent performing common naval collision avoidance maneuvers in the AIS-based environment.The agent trajectory is drawn as a blue dashed line, and the target ships are drawn in red.The dotted vessel outlines show their positions 100 time steps prior to the present time.
Risk-based agent's trajectory in the Trondheim test scenario.The agent exhibits excellent path following, and is seen to maneuver the crossing traffic before returning to the path and reaching the goal.Risk-based agent's trajectory in the Froan test scenario.When presented with an impossible path, the agent intelligently finds another solution for navigating the dense archipelago and merge into the parallel traffic.

Figure 15 :
Figure 15: Trajectories from three different AIS-based environments drawn as blue dashed lines.Target ships and trajectories are drawn in red.

Table 1 :
Path-following feature vector s n at timestep t.Using a set of rangefinder sensors as the basis for obstacle avoidance is a natural choice, as it yields a comprehensive yet intuitive representation of any neighboring obstacles.This configuration should also enable a relatively straightforward transition from the simulated environment to a real-world one, given that rangefinder sensors such as lidars, radars, sonars, or depth cameras are commonly used

Table 2 :
Reward configuration for the risk-based approach.

Table 3 :
Results from repetitive testing of COLREG-compliance with slightly varying scenarios, 100 episodes.