A Big Data Analytics Method for the Evaluation of Ship - Ship Collision Risk reflecting Hydrometeorological Conditions A Big Data Analytics Method for the Evaluation of Ship - Ship Collision Risk reflecting Hydrometeorological Conditions Reliability Engineering and System Safety

This paper presents a big data analytics method for the evaluation of ship-ship collision risk in real operational conditions. The approach makes use of big data from Automatic Identification System (AIS) and nowcast data corresponding to time-dependent traffic situations and hydro-meteorological conditions respectively. An Avoidance Behavior-based Collision Detection Model (ABCD-M) is introduced to identify potential collision scenarios and Collision Risk Indices (CRIs) are quantified when evasive actions are taken for each detected collision scenario in various voyages. The method is applied on Ro-Pax ships operating over 13 months of the ice-free period in the Gulf of Finland. Results indicate that collision risk estimates may be extremely diverse among voyages, and in 97.5% of potential collision scenarios the evasive actions are triggered only when risk is at 45% or more of its maximum value. The overall CRI for ships operating over the given area tends to be lower for adverse hydro-meteorological conditions. It is therefore concluded that the proposed method may assist with the (1) identification of critical scenarios in various voyages not currently accounted for by existing accident da- tabases, (2) definition of commonly agreed risk criteria to set off alarms, (3) the estimation of risk profile over the life cycle of fleet operations.


Introduction
Ship collisions and groundings are the most frequent maritime traffic accidents globally [38]. They often result in unwanted and devastating consequences such as oil spills, severe ship flooding or loss of human life [33]. Their effect is especially critical for passenger shipping operations [52]. To mitigate risks associated with such events it is necessary to develop maritime risk management tools.
Ship trajectories data streams incorporate multiple parameters related to static voyage features (e.g., departures/destinations, voyage length) and dynamic navigation features (e.g., speed, course, motion parameter variation, and ship trajectory spatial distance). However, it may be challenging to handle all available information using the available labels (i.e., MMSI, IMO number, call signs) delivered from AIS data [90]. An alternative could be to use unsupervised machine learning theory and apply clustering analysis of big data analytics with the aim to classify complex traffic scenarios preferably in real hydro-meteorological conditions. Typical unsupervised machine learning methods, clustering algorithms can automatically cluster ship traffic data by similarity measurements. They can be classified into three groups, namely: (a) distance partition methods (e.g., K-means algorithm; see [8,90,95]); (b) hierarchy methods (e.g., Balanced Iterative Reducing and Clustering using Hierarchies -BIRCH algorithm; see [43,92]); (c) density methods (e.g., Density-Based Spatial Clustering of Applications with Noise -DBSCAN: see [63,93]). A suitable selection of one of those could help classify ship trajectories and detect anomalies based on the maneuvering behavior of ships under real operational conditions (e.g., [5,8,10,43,62,69,97]). Distance partition methods have been adopted in ship trajectories clustering due to their high-efficiency performance [90]. Hierarchical methods suffer from the fact that once merge or split is done, it is not reversible [43]. Density methods are of great representativeness owing to their superiority in clustering ship trajectories with arbitrary shapes [5,63]. To date, the mentioned algorithms have been successfully used to cluster simplistic ship trajectories in open seas. Nevertheless, they fail in restricted waters where operational paths are more complex (e.g., [5,43,97]). This is because it is challenging to handle all available information of complex ship trajectories delivered from AIS data using a single algorithm.
Therefore, it is desirable to develop a big data analytics method for evaluation of ship-ship collision risk in various voyages using now-cast data and AIS data, by recovering detailed time-dependent traffic situations and the hydro-meteorological conditions at the times. This would allow insight to be gained into collision risk reflecting real operational conditions, as well as exploring the time to trigger evasive actions in various voyages [53].
This paper introduces a data mining method for ship collision avoidance behavior. The method detects collision scenarios based on clustered ship trajectories encompassing AIS and hydro-meteorological big data streams at the time of collision avoidance maneuvers in various routes (see Section 2). Consequently, the time during evasive actions taken is analyzed using a multi-criteria-based CRI. The practical application of the approach is demonstrated by the use of data covering a 13-month ice-free period in the Gulf of Finland, considering all large RoRo/Passenger ships (RoPax) (46,124 GT > Gross tonnage > 10,000 GT; 218.8 m > Length > 120 m) as the struck ships (see Section 3). The paper concludes on the potential of the method to develop intelligent decision support systems to mitigate collision risk by inspecting traffic patterns in various voyages and ship-ship collision risk (see Section 4).

Machine learning methods
AIS is an automatic tracking system that may be used to identify and locate ships through data exchange with nearby ships, AIS base stations and satellites. The use of this system has been required by the International Maritime Organization (IMO) since 2004 and to date transponders have been installed in more than 400,000 ships. AIS big data streams contain multiple parameters related to static voyage features (e.g., departures /destinations, voyage length) and dynamic navigation features (e.g., speed, course, motion parameter variation, and ship trajectory spatial distance). Although IMO number/call signs can be used as labels to separate ship trajectories (STs) of various ships, existing methods do not offer automatic means for ship trajectories clustering in various voyages. This is because it is difficult to derive available labels to fully explore both static voyage and dynamic navigation features of STs in real environmental conditions and complex traffic scenarios. Thus,  [63]. Clustering algorithms, as typical unsupervised machine learning methods, can automatically cluster ship trajectories through similarity measurements of ship trajectory feature. However, toward to massive and complex ship trajectories in restricted waters, they are difficult to be clustered in more detail using a unique algorithm. To evaluate ship-ship collision risk in various voyages associated hydro-meteorological data in time-dependent traffic scenarios, the ship trajectories of struck ships should be classified in more detail, according to the similarity in both static voyage features and dynamic navigation features. With the latter in mind in this work K-means and DB-SCAN are selected and employed to cluster STs. This is because the k-means algorithm is high-efficiency performance in clustering ship trajectories using static voyage features, and DB-SCAN is of great representativeness owing to their superiority in clustering ship trajectories using dynamic navigation features. Accordingly, the complex ship trajectories can be clustered in more detail combining K means and DB-SCAN.

K-means algorithm
K-means is a clustering algorithm that distance partitions data points into groups based on Euclidean Distancese.g., [90] as presented in Table 1. It is easy to understand, implement and can handle large datasets. It requires clear specification of the desired number of clusters, which is easy to determine based on static voyage features (e.g., departure and destination points, voyage length). However, it may be sensitive to the number of clusters and the presence of noise in big data streams (e.g., outlying points in the trajectories as explained in Section 2.2.1). In K-means, similarity denotes the degree of similar trajectories measured. Accordingly, two STs are similar if their departure, destination, and voyage length are similar. The K-means algorithm can be efficiently used to cluster trajectories of ships navigating in a specific voyage route (i.e., in between the same departure and destination points).
However, even though the ships navigate in a specific voyage route, dynamic navigation features (e.g., speed, course, motion parameter variation, and ship trajectory spatial distance) may be diverse. Clustering test shows that if we consider more than three parameters (departure and destination points, voyage length) for STs clustering, the performance of k-means is not worked well. This is because the K-means algorithm is difficult to handle all available information (both static voyage features and dynamic navigation features) of complex ship trajectories. Thus, dynamic navigation features also should be mined to explore the difference of ship trajectories using DB-SCAN following Kmeans.

DB-SCAN algorithm
In contrast to K-means method that applies to static points datasets DB-SCAN is an algorithm that helps to form data clusters based on regular and irregular dense data. Those data may be associated with dynamic navigation features following K -means clustering. But DB-SCAN algorithms may not work well with static voyage features (distance points datasets) of STs. This is reason why the both K-means algorithm and DB-SCAN algorithm are used to cluster. STs in the paper. In the process of DB-SCAN clustering, data are divided into three categories, namely: (a) core, (b) border, and (c) noise; the latter ones associated with low-density data streams [93]. The algorithm does not require specifying the number of clusters in advance, as presented in Table 2. STs are similar if their voyage/navigation features and spatial distance have similar data densities (see Section 2.2.1). So the DB-SCAN algorithm is employed to cluster STs with similar motion parameters in the same voyage route after K-means clustering, like speed, course, and their variations, as well as spatial trajectory distance between the same departure and destination points (See ST 3,4 and ST 5,6 after DB-SCAN clustering in Fig. 3).

Big data analytics framework
The collision risk evaluation framework ( Fig. 1) comprises of three steps: • Step (i) where STs are reconstructed using AIS data that contain static voyage and dynamic navigation details. The process is used to cluster ship trajectories of the struck ships. Static voyage details (departure and destination points, voyage length) are illustrated to cluster ship trajectories using K-means if their departure, destination and voyage length are similar. Then, DB-SCAN is used to re-cluster results based on dynamic navigation features (speed, course, motion parameter variation, and spatial ship trajectory distances). STs   [91]. Then, collision scenarios and hydro-meteorological data at that time associated with each cluster are stored in a database for further collision risk analysis in more detail.
• Step (iii) -for each collision scenario detected under Step (ii) the collision risk when evasive actions are taken is evaluated using a CRI estimation model. More specifically, the risk profiles of ships are analyzed for each cluster by a method accounting for potential collision events over a pre-defined period corresponding to specific ship type operations in an area of reference. The results of CRIs are explored by statistical analysis accounting for real hydrometeorological conditions.

Step i: Clustering of ship trajectories
The flowchart of AIS trajectory clustering using K-means and DBSCAN is depicted in Fig. 2. It consists of three steps, namely: (a) reconstruction of STs; (b) grouping of static data by K-means and (c) clustering of dynamic data by DB-SCAN. For step (a), throughout the clustering process uncertainties in AIS big data streams may relate to collection, transmission and reception errors [86]. AIS data may also not be transmitted at the same time. This may cause data streams of different ships to be out of sync [74,87]. Thus, AIS data reconstruction requires trajectory separation, data filtering (i.e. outliers removal), and interpolation over 20 s intervals [30,79,80,84].
Using the proposed unsupervised machine learning method based on K-means and DB-SCAN algorithms, complex traffic scenarios can be explored in more detail in various voyages. An example of the ST clustering process for one ship with 6 STs (voyages) sailing in a given area is depicted in Fig. 3. Therein the direction of ST1 is opposite to ST2, likewise ST3,4 are opposite to ST5,6. Despite ST3 and ST4 describe trajectories of ships navigating between the same departure and destination points, these are different. In a similar manner, ST 5 and 6 head in the same direction, but the speeds of the ships along the trajectories are differentships on ST5 is faster than a ship on ST6. Separation of the STs and exploration of the collision risk is achieved as follows: • K-means algorithm is used to classify STs into 4 clusters using static voyage features (departure, destination, voyage length). In this way, ST1, ST2, ST3,4,and ST5,6 should be positioned in different clusters. • DB-SCAN algorithm is employed to re-cluster results using dynamic navigation data (ship speed, course, motion parameter variation and trajectory spatial distance). In this way, ST3, and ST4 (ST5, and ST6) should be positioned in different sub-clusters.
The STs are clustered into 6 clusters. Similar ship trajectories are grouped into the same cluster. Thus, a cluster may contain more than one similar ship trajectories/ voyages.
The adequacy of the approach depends on the availability of AIS data ( Fig. 4 and Table 3). Along with a trajectory, paths are defined as follows: for where p i j is a point in 2D space that contains MMSI number of the ship, timestamp, geographical position, speed, course, heading, ship type, ship length, ship width, and draft; j is the timestamp of this point; n is the total number of the points in the trajectoriesTr i and p i 1 , p i n represent ship departure and destination points.
The clustering of STs between the same departure/destination points is defined by the similarity parameter S p dd . This is a set including the distance between ship departure (p 1 i ,p i+y 1 ) and destination points (p n i , p i+y n ). It is used to identify the STs sharing the departure points (p 1 ), destinations (p n ), and vice versa as follows:  where (lon 1 , lat 1 )and (lon n , lat n ) denote longitude and latitude of the departure and destination points, respectively; n is the total number of the waypoints of the ST i.Voyage length is defined as: K-Means clusters the similarity of voyage features of different STs based on the main difference between departure p 1 , destination p n and voyage length. The similarity parameterS p dd denotes the main difference between alternative departure (p 1 i ,p i+y 1 ) or destination points (p n i , p i+y n ). On the other hand S st dd is defined as a similarity set that uses the sum of distances of the same departure and destination points Tr i and Tr i+y according to the equation: The similarity parameter S l denotes the difference in the voyage length of different trajectories defined by Equations (5,7). If the value of the similarity parameter S l is small, and STs of ships navigating between the same departure points and same destination points, then: Consequently, STs can be clustered using K-mean algorithm based on the following three factors defined as points in three-dimensional space: similarity parameter S l and similarity parameters S p dd and S st dd . Additionally, if we consider more than three above parameters for STs clustering using K-means, the performance is not worked well. Thus, dynamic navigation features also should be mined to explore the difference of ship trajectories in more detail using DB-SCAN in the same voyage route.
The navigation features of STs consider AIS data, including SOG, COG, and variations of those (e.g., average value, median value, and variance). The average and median value of COG are used for determining the course feature defined by similarity parameters: The motion parameter variation features are defined as follows: To present the difference of navigation features of various trajectories, S sog , S cog and S mpv (Tr i , Tr i+y ) are defined as: where, the sog mean and sog median represent the average and median values of SOG, respectively; the cog mean and cog median represent the average and median values of COG, respectively; thesog interval , sog std , cog interval , and cog std denote variable interval and standard deviation of SOG and COG; Tr i and Tr i+y represent different STs. Voyage details and navigation features are delivered from tempospatial AIS data. To calculate the spatial distance of two STs using discrete AIS points of STs, the spatial similarity of STs is calculated using the Hausdorff distance algorithm [40]: The spatial similarity parameter of two different STs is defined as: where, h(Tr i , Tr i+y ) denotes the Hausdorff distance of trajectory Tr i to Tr i+y and the h(Tr i+y , Tr i ) denotes the Hausdorff distance (see Fig. 5) of ST Tr i+y and Tr i ; S h is the spatial similarity parameter of different STs.
Clustering of voyage features (e.g., departure/destination, voyage), navigation features (e.g., speed, course, and ship motion parameter variation, spatial distance), and spatial distance of trajectories by the DB-SCAN method is achieved by: where, Sdenotes the multi-criteria feature of ST, ω i indicates the weight of the above-mentioned feature parameters. The weights ω i of the feature parameters are tested using a small sample based on the evaluation equation (31). Experience shows that when the weights ω i are determined as [0.13, 0.16, 0.21, 0.12, 0.12, 0.09, 0.17] the performance of STs clustering is best. Due to their different dimensions features and spatial trajectory distances must be normalized according to the similarity estimation matrix [65]: where, n is the number of the STs for clustering, S kk is the multi-criteria feature of ST Tr i and ST Tr i+j .

Step ii: Collision detection
During this stage a database utilizing global now-cast data from different providers is developed. Wind data are obtained from US NOAA (https://www.noaa.gov/); Wave and tide data are based on Tidetech (https://www.tidetech.org/) and Ocean currents information is described as per Mercator Ocean (https://www.mercator-ocean.fr). The applicability of now-cast data is confirmed by comparisons against onboard measurements [27]. In these records, swell and wind wave components are presented by significant wave height, wave zero-crossing period and wave direction over 60 minutes. The spatial resolution of 1.25 km is used [34]. From now-casts, wave heights can be obtained within 0.3 meters of uncertainty (globally). Wave periods are estimated within 2s (e.g., [4,46]). The accuracy of main sea weather forecast models is evaluated by comparing records against data collected on weather buoys using RMS error estimators [27,30]. The hydro-meteorological data are interpolated to the ship position and time delivered from AIS data. The interpolation process follows the principles outlined in Appendix A and comprises of the following steps (1) In Fig. 6, as potential collision scenario is defined a ship -ship encounter that comprises of four stages, namely (a) unconstrained navigation; (b) encounter; (c) collision avoidance; (d) clearance. Ship evasive actions take place when a ship performs course or speed alterations or both. In Fig. 6, STs Tr i and Tr i+y relate to struck and target ships. During stages (a) and (d) the risk of collision between the two ships is negligible, either because of the distance between two ships (stage a) or their diverging courses (stage d). At encounter stage (b) when the rate of change of relative bearing angle Δβ relative to struck ship falls within [-2.00 to +2.00], the risk of collision is defined by COLREGs [35], and a collision may occur unless evasive action is taken. If the distance between two ships reduces but the rate of change of relative bearing angle Δβ exceeds the range of [-2.00 to +2.00], this indicates the striking ship (give-way ship) changes her course to avoid collision. The critical point associated with maximum rate of relative bearing angles Δβ is defined as the time of evasive action taken. Thus, she enters the collision avoidance stage c (see timestamp k + t in Fig. 6). At this stage, ships converge and the minimum distance between STs of striking and struck ships is below 3 nm, the minimum DCPA is below 1 nm and the minimum TCPA is located within (0 to 30) mins. The end point of the collision avoidance stage is defined as the point where TCPA becomes 0. If TCPA is below 0, there is no collision risk, the distance between two ships is increasing, and the stage of clearance begins.
The Avoidance Behavior-based Collision Detection Model (ABCD-M) used to detect collision can be described as follows: Part A where the coordinate system is converted from the earth-fixed (AIS) to struck shipfixed status (see more in Appendix B). In this part we determine the minimum distance between two STs. This requires that STs of potential striking ships keep clear from the struck ship to minimize the potential of collision. The minimum ship distance d min is defined at timestamp k + i corresponding to STs as Tr the timestamp interval of the two series (see more in Appendix B); Part B during which we determine collision avoidance behaviors during ship encounters based on ship course, relative bearing angles Δβ, rate of turn (ROT), TCPA, DCPA. The calculation process in Appendix B), and the difference between the headings (Fig. 6); Part C where we classify collision scenarios as per COLREGs ( Fig. 7 and Fig. 8).
To analyze the collision avoidance behaviors, the STs Tr T and Tr O during evasive action are defined by: Based on the detected potential collision scenarios, to analyze the geographical relationship of potential conflicts from struck ships perspective, the coordinate system should be converted from the earthfixed (AIS) to struck ship-fixed (see Appendix B). Whereas potential collision scenarios are classified into three types: head-on, crossing and overtaking, as depicted in Fig. 7.
Collision avoidance maneuvers that do not comply with COLREGs are usual during navigation [64]. Those are so-called cooperative collision avoidance maneuvers of two ships, which indicates that two ships understand the collision situations through communication and work out jointly the solution. A demonstration of such scenario is given in Fig. 8 from struck ships perspective, those according to COLREGs.
Ideally the maneuvers of the give-way ship should be along the green track. However, some give-way ships may take the evasive actions along the red track defined as the illegal evasive actions that should be culled. This is because non-compliant to COLREGs evasive actions cannot be used to define commonly agreed risk thresholds for intelligent decision support system development. In such encounters communication between the vessels involved may lead to accident resolution. The detected illegal evasive actions (cooperative collision avoidance scenarios) may be analyzed separately for the research regarding ship collision avoidance under human-machine interaction.
Illegal evasive actions are detected using the relative bearing angle β from striking to struck in collision avoidance stage (Fig. 6) from struck ship perspective, shown Fig. 8 according to COLREGs Rule 13, 14, and 15 [29]. The pseudocode for COLREGs Rule 15 (crossing collision scenario) is summarized in So, the relative bearing angle β will decrease to less than 270 0 . Otherwise, the evasive actions will be defined as COL-REGs uncompliant in head on situation [73]. Table 4. The crossing collision scenarios are classified into three cases based on COLREGs Rule 15 (Fig. 8).
• If 5 0 <β< 67.50 the struck ship is the give-way ship, and the striking ship should pass from the bow during the collision avoidance stage. The relative bearing angle β from struck ship perspective will increase to more than 180 0 . • If 67.5 0 <β< 112.5 0 the struck ship is the give-way ship and she is obliged to alter her course to the starboard side or reduce speed during the collision avoidance stage. In this situation, all the evasive actions are COLREGs compliant. • If 247.5 0 <β< 355 0 the striking ship is the give-way ship and the striking ship should pass from the stern during the collision avoidance stage. In this situation the relative bearing angle β of the struck ship will decrease to less than 180 0 .
Examples of potential collisions during the encounter stage from struck ship perspective are shown in Fig. B.1 of Appendix B. Notably, if the collision avoidance behaviors violate the above terms from struck ship perspective, the evasive actions will be defined as COLREGs uncompliant focusing on crossing collision scenarios.
The pseudocode for overtaking collision scenario (COLREGs Rule 13) is shown in Table 5. In this case if 112.5 0 <β< 247.5 0 the speed ratio of striking/struck ships is more than 1 (i.e., the striking ship is faster than the struck ship). Thus, the striking ship is the give-way ship, and should overtake the struck ship during the collision avoidance stage (Fig. 8). The relative bearing angle β from struck ship perspective will increase to more than 270 0 or decrease to less than 90 0 . Otherwise, the evasive actions will be defined as COLREGs uncompliant, focusing on overtaken scenarios. On the contrary, if the struck is the overtaking ship, all the evasive actions are legal (turn to the port side or starboard side). Besides, Ro-Pax ships are defined as struck ship in the paper. We only consider the overtaken cases. Finally, the pseudocode for the head-on collision scenario (COLREGs Rule 14) is shown in Table 6. If 0 0 <β< 5 0 or 355 0 <β< 360 0 the struck ship is the giveway ship, and the ships should pass each other port-to-port during the collision avoidance stage (Fig. 8). So, the relative bearing angle β will decrease to less than 270 0 . Otherwise, the evasive actions will be defined as COLREGs uncompliant in head on situation.

Step iii: CRI estimation
CRI presents the risk of ship -ship collision by evaluating the geographical relationship of potential conflicts. The applications of CRI method can be classified into two groups: (a) a specific value of CRI defined by expert' knowledge is used as risk criteria to detect ship conflicts (e.g., [17,71]); (b) the CRI model is employed to quantify collision risk for collision avoidance (e.g., [31,66]). However, the former lacks commonly agreed on risk criteria to show what is the real dangerous situation or what is time to take collision avoidance [53]. Thus, using the detected potential collision scenarios in real operational conditions under Step (ii), the CRI method adopted in this paper is used to quantify collision risk when the give way ships take the evasive actions under COLREGs compliant in real operations. The wide set of data can be used to calibrate a commonly agreed risk criteria value by statistical analysis, which is defined by expert' knowledge in previous research. The CRI method is represented as: The risk value for DCPA is defined as: where d 1 is the minimum safe meeting distance, and d 2 is the minimum distance between the striking ship and struck ship. In practice, to avoid a collision accident a striking ship should not pass the struck ship at a distance shorter than the one that is considered safe [18,19,52,56]. According to Gang et al. [17] such distance can be calculated as follows: where β is the relative bearing angle from striking ship to struck ship. The risk value for TCPA is defined as: The procedure pseudocode for culling the illegal evasive actions focus on crossing situation.  Table 5 The procedure pseudocode for culling the illegal evasive actions focus on the overtaken situation.  Table 6 The procedure pseudocode for culling the illegal evasive actions focus on the head-on situation. for the ship to sail from the current location to the point with minimum distance and S r is the relative speed between two ships. The risk value for the distance between the striking and struck ships (D) is defined as: The risk value for the relative bearing angle β between striking and struck ships is defined as: The risk value for the ship speed ratio of striking and struck ships is defined as: where sinC = |sin(|θ T − θ 0 |)|, k = (V 0 / V T ), θ T and θ 0 are the course of the striking ship and struck ship. The mentioned factors above contribute to CRI. But the degree of influence in CRI is different. According to Gang et al. [17] and Hu et al. [31], the degree of the mentioned five factors influencing in CRI is defined as per equation (29), and the weighting factors are determined as presented in the equation (30).
Overall, CRI is a single crisp value reflecting the risk of collision with other ships, which summarizes the mentioned five factors influencing in collision risk by equation (30). Usually the CRI for two ships is a cost-like value. It trends to be higher for the higher of the collision risk (the higher CRI value, the higher of maneuvering difficulty of ship avoidance). CRI weighting factors usually are set as 0.40, 0.367, 0.167, 0.033 and 0.033, respectively, which are determined by quantifying the difficulty of ship avoidance in various conflicts using the navigation simulator [78,96]. The CRI calculated using equation (30) usually is used in collision risk evaluation and collision avoidance research. (e.g., [17,31,66,71]).

Case study
As part of a practical case study we analyzed more than 4 billion AIS and hydro-meteorological data records describing various conditions over 13 months of ice-free navigation period of atypical Ro-Pax ships steaming through the Gulf of Finland. As a result, the estimates describing the risk of collision for this ship are derived. The information on ship specification and study area are presented in Table 7. Fig. 10 shows that the STs of the mentioned Ro-Pax ships are complex and irregular.

Ship trajectories clustering into various voyages
The K-Means algorithm was used to cluster voyage details of STs (Fig. 2). As part of this process STs were reconstructed and then separated based on the distribution of time between voyages. Then, STs were separated for those cases the time interval between two ships exceeded 360 s. The 12,214 ship voyages of struck ships were divided into 8 clusters using the proposed method in Section 2.1 and Section 2.2.1 (see Fig. 11 and Table 8). Detailed analysis confirmed that K-Means can help classify STs using static voyage details.
The DB-SCAN algorithm was used to classify STs based on dynamic big data streams. The algorithm contains two threshold parameters, namely, MinLns and ε [92,93], where ε denotes a spatial distance threshold delimiting the neighborhood of a ST and MinLns denotes the minimum number of STs required to form a dense cluster. Formula (31) is often used to evaluate the performance of the clustering method. The parameters of clustering methods can be evaluated according to [45] as: Where C and N represent normal categories and abnormal results, dist(Tr x , Tr y ) represents the distance between trajectories Tr i and Tr i+y .
Theoretically, the lower the value E is, the better performance of clustering becomes. In this paper, several groups of MinLns (1 to 9) were compared with ε between 0.001 to 0.01. The experiences show that when the MinLns and ε are determined as 6 and 0.006, the valueE is lowest, showing that the performance of STs clustering is the best.
The results illustrated in Table 9 (see also Fig. 12 and Fig. 13) contain 16 sub-clusters on top of those initially identified by the K-means method ( Fig. 3 and Table 8). Sub-clusters (1), (2), and (3) represent ship traffic behaviors for trips from the port of Helsinki to West Baltic and Russia. Sub-clusters (7), (9), (11), (13) and (16) represent ship traffic behaviors entering the port in Helsinki to Baltic Sea, Russia and Tallinn. Sub-clusters (4), (6), (14), and (15) represent the ship traffic behaviors of entering the port of Tallinn from the Baltic Sea, Russia, and Helsinki. Sub-cluster (5), (8), (9), (10) and (13) represent the ship traffic behaviors of leaving the port in Tallinn to the Baltic Sea, Russia, Helsinki. Subcluster (12) represents the ship traffic behaviors from the Baltic Sea through the Gulf of Finland, heading directly to Russia. In addition, some incomplete STs are classified under cluster 17. STs belonging to the same sub-clusters are similar to each other in the navigation features. The results show that the proposed method exhibits effective performance associated with marine traffic pattern recognition using massive STs.

Statistical analysis
STs were compared in terms of shape, speed and course (see Fig. 13, Fig. 14 and Fig. 15). From an overall perspective, the clustered STs are different (Table 9). However, within the same cluster, STs show a high similarity when it comes to voyage and navigation features. The reason for the latter is that struck ships encounter different traffic densities associated with different collision scenarios in different clusters.
The available weather database is listed in Appendix A. Hydrometeorological data history records for STs of all clusters at different locations and global ocean now-cast records were reviewed using online weather database and the trilinear interpolation method of Appendix A. Table 10 and Fig. 16 demonstrate the hydro-meteorological parameters cumulative distributions for the 2-year operations of struck ships in the Gulf of Finland.
Analysis of the results shows that in the Gulf of Finland in the ice-free period and for 99% of the time, all Ro-Pax ships navigate in wave heights smaller than 3.24 m, the swell height of less than 1.49 m, wind speed conditions that are less than 17.91 m/s over ground and currents are less 0.51 m/s over-ground. However, the combination of these conditions does not reflect the hydro-meteorological data encountered in one area of operation over the same time of the year. They rather reflect extreme encounters in different areas of operation during different times.

Collision scenarios
Potential collision scenarios were detected by applying the approach of Section 2.2.2 (see Fig. 6). To present the relationship between struck and striking ship using AIS, the origin of the original WGS-84 coordinate system was converted to a struck ship-fixed system (see Fig. 17(a)). For unconstrained navigation 266,666 pairs of STs were merged within the 6 nm conventional radar range [24,82]. Furthermore 138,973 pairs of STs were selected, and the minimum distance within each pair was found under 3 nm (see Fig. 17b). 31,079 pairs of STs were obtained over the two years of maritime operations. The relative bearing angles between ships involved in different scenarios varied from [-2.00 to +2.00] over 6 min observations. During the collision avoidance and clearance stages DCPA and TCPA threshold conditions were applied. This resulted in 10,781 potential collision scenarios. Of those 9,240 were COLREGs compliant. The remaining were assumed to be illegal evasive actions (cooperative anticollision behaviors) and were culled according to COLEGs Rules 13, 14, and 15 (Table 11). Fig. 18 demonstrates the locations of striking ships triggering evasive actions for 12,214 voyages of struck ships during ice-free operations. Most potential collision scenarios were located between Helsinki -Tallinn because of high traffic complexity. A radar display is shown in Fig. 19, where the struck ship is in the center, therein the blue scatter denotes the positions of the striking ship taking evasive actions. As the speed of Ro -Pax ships is high, most striking ships were located in the 1 st and 4 th quadrants ahead of the struck ships (see Fig. 19) and their density lied within [1 km to 4 km] radius. Higher density areas were visible for relative angles 10 0 , 80 0 , and 280 0 in relation to the struck ship. These can provide essential guidance to crews to understand the striking ships distribution from own ship perspective. They may also be used to identify higher collision risk areas while onboard. A summary of hydro-meteorological data accounting for evasive actions during collision encounters is shown in Table 12. The results are based on the method presented in section 2.2.2 (see also Fig. 14 and Table 10). Fig. 11. The STs of clustered results after K-Means Table 9 Sub-cluster descriptions after DB-SCAN. After leaving the port from Helsinki heading directly to Russia. 41 4 After leaving the port from Helsinki heading directly to Tallinn.  570  5 After leaving the port from Tallinn, heading directly to the Baltic Sea in the western in coastal waters.

6
After leaving the port from Helsinki heading directly to Tallinn.  4127  7 From the Baltic Sea to the port in Helsinki. 362 8 After leaving the port from Russia to Tallinn. 28 9 After leaving the port from Tallinn, heading directly to the Helsinki. 571 10 After leaving the port from Tallinn to the Baltic. 40 11 From the Baltic Sea to the port in Helsinki in coastal waters. 375 12 Form the Baltic Sea through the Gulf of Finland, heading directly to Russia.

13
After leaving the port from Tallinn to Helsinki. 4098 14 From the Baltic Sea to the port in Tallinn in coastal waters.  319  15 From the Baltic Sea to the port in Tallinn in open sea. 10 16 After leaving the port from Russia, heading directly to the Helsinki. 84 17 Incomplete ship trajectories 467     *Note: In Winter, the Gulf of Finland may be ice-covered for several weeks. Thus the ice-free period is considered here as dominating in this area.
The analysis identified 16 clusters containing complete STs between departure and destination points and 1 cluster containing incomplete STs (see Fig. 13 and Table 9). Collision scenarios (Fig. 18) were classified into 16 clusters (Fig. 20 and Table 13). Consequently, it was found that 50% of the potential collisions occur in cluster 13 (i.e., after leaving the port of Tallinn and towards Helsinki). The mentioned clusters in Fig. 20 denote the grouped STs (see Fig. 13 and Table 9). This observation leads to the conclusion that potential scenarios can be evaluated, focusing on various clusters (voyages) in more detail. The frequency denotes the number of occurrences of potential collisions per journey during the period, calculated using the Formula (32).  Fig. 20 show that the number of potential collisions per journey is at its highest in cluster 11 (3.03 potential collisions per journey). Notably, in clusters 12, 15 Ro-Pax ships do not encounter other vessels. Clusters 6 and 13 are located in the same route, but the voyage is reversed between Helsinki and Tallinn. In cluster 6, 0.25 potential collisions per journey or one potential collision per 4.0 Ro-Pax journeys occur. However, 1.13 potential collisions per journey in cluster 13 are 4.52 times that those observed in cluster 6. The results show that the collision frequency is diverse in various voyages, even though they navigate in the same route.

Collision risk index analysis during evasive action triggered
Potential collisions are detected based on grouped STs using the proposed method presented in Section 3.3. The aim of this section is to calibrate risk criteria to trigger evasive actions by quantitatively assessing CRI. An example is presented in Fig. 21. At point 29 of This information could help provide essential guidance for triggering evasive actions in time. To validate the results of the detected potential collision scenarios, the TCPA and DCPA distributions are analyzed as shown in Fig. 24. Results confirm that if a struck ship's course falls into these eventualities, action should be taken to avoid collision (e.g., [57,58], and [98]). Fig. 25 shows the CRI distribution during evasive actions taken, indicating that most of the striking ships with the highest collision risk Fig. 19. The locations of striking ships, corresponding to potential collisions while obvious evasive actions are taken (blue scatters denote the relative locations of striking ships) *Note: An entire ST of a struck ship often encounters more than one striking ship resulting in more pairs of STs of the struck ships and striking ships available.   Fig. 19 and Fig. 25 the collision risk level distribution appears to be different in relation to the location density of striking ships. It may be therefore concluded that a higher collision risk area may lead to more serious accidents, and the location density of striking ships influences the number of potential collision locations related to the struck ship.

Collision risk relationship among hydro-meteorological conditions
To understand the dependence of CRI with evasive actions and hydro-meteorological conditions, correlation analysis was carried out using the approach of Pearson Correlation Coefficient (γ) [41] and Mutual Information (U) [49]. The method of the Pearson coefficient assumes normal data distributions. It is therefore thought to be sufficiently representative of positive or negative correlations and assumes linear relationship between CRI and hydro-meteorological conditions. On the other hand, MI is a measure of the mutual dependence between the two variables, which is more general and helps determine joint distributions. Not limited to real-valued random variables and linear dependence like the Pearson Correlation Coefficient. by using the MI test, the uncertainty coefficient (U) is calculated here that determines how large a proportion of the uncertainty about collision risk can be decreased by observing the hydro-meteorological condition variables. Table 14 summarizes statistical correlations.
Negative γ correlations imply that adverse hydro-meteorological conditions may be associated with decreased CRI value during the evasive actions triggered. The negative statistical correlations between CRI and wave height, wind speed, and swell height imply lower risk for encounters under adverse weather conditions when the bridge crew may  *Note: The frequency of potential collision scenarios denotes the number of occurrences of potential collisions per voyage during the period.  wish to initiate collision evasive actions at longer distance to the target, accounting for the effect of wave and wind on ship maneuverability. The value of γ correlations is low, showing that the correlation between collision risk and hydro-meteorological conditions is more complex instead of linear. Therefore, the MI test is employed, which reveals that by getting to know the hydro-meteorological condition in more detail. Thus, based on the results of this study and within its boundaries, the negative γ correlations and positive U correlation variation show are influencing factors related to the CRI value, affirming that adverse hydro-meteorological conditions evasive actions are associated with lower CRI in real operations. This finding may be supported by triggering the evasive actions in various hydro-meteorological conditions, showing that the give way ships should trigger the evasive actions with lower CRI value in adverse hydro-meteorological conditions. Notwithstanding, further studies are needed to quantify the effect of hydrometeorological conditions on CRI in more detail.

Conclusions
The paper introduces a big data analytics method for evaluation shipship collision risk based on collision avoidance behaviors, with a RoRo/ Passenger ship (RoPax) being considered as the struck ship. The big data analytics method introduced accounted for (1) A data mining model to cluster STs of struck ships using unsupervised machine learning algorisms (K-means and DB-SCAN); (2) the identification of time-dependent traffic situations and associated hydro-meteorological conditions at the times of potential collision in the different clusters; (3) ship collision risk assessment using CRI model during evasive action taken. The method is demonstrated using data covering a 13-month ice-free period in the Gulf of Finland, considering all large Ro-Pax ships (46,124 GT > Gross tonnage > 10,000 GT; 218.8 m > Length > 120 m). Key conclusions may be summarized as follows: • The innovative use of the data mining method combining K-means and DB-SCAN for clustering struck STs is promising and useful for collision risk evaluation in more detail. • Now-cast data and AIS data are useful for recovering detailed timedependent traffic situations and the hydro-meteorological conditions at the times of unwanted events. • The voyage may be the key influential factor contributing to collision risk, which is ignored in the traditional models (Fig. 20, and Table 13). • Big data analytics help understand the location distribution of striking ships (Fig. 19) and the degree of collision risk during evasive actions taken in real operational conditions (Fig. 23,25), indicating that both higher collision risk hotspot areas and higher density hotspot areas should be considered to design remedial steps for collision avoidance. • 97.5% of mentioned scenarios account for evasive actions when CRI is greater than 0.45 (Fig. 22). The CRI criteria outlined may provide important support to the master on Ro-Pax ships, as part of an intelligent decision support system for collision avoidance. However, the right time to take any evasive action is also influenced by other factors, e.g., hydro-meteorological conditions, ship navigation systems (specifically the autopilot and the ARPA radar), operational instructions, and procedures by the shipping company. • Adverse hydro-meteorological conditions seem to decrease the CRI, indicating that the give way ships tend to take evasive actions earlier that in favorable hydro-meteorological conditions (see Table 14).

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence   the work reported in this paper.

Acknowledgments
The work presented in this paper has been carried out within EU Horizon 2020 project FLARE (Grant No.: 814753-2). All authors acknowledge NAPA Ltd. for the provision of the AIS data used in the study case presented under Section 3 of this paper. The views set out in this paper are those of the authors and do not necessarily reflect the views of their respective organizations or of the FLARE consortium. Dr. Hirdaris acknowledges the financial support received from the Academy of Finland University competitive funding award (SA Profi 2-T20404) at the early stages of this work. Mr. Mingyang Zhang acknowledges the support of the Finnish Maritime Technology Foundation (Merenkulun Säätiö).

Appendix A: Interpolation method of hydro-meteorological data associated with AIS data
Hydro-meteorological data history records for each ship at different locations and global ocean now-cast records are reviewed. As part of this process we captured data streams with information on swell, wind, waves and sea currents. Swell and wind wave components are presented by significant wave height, wave zero-crossing period and wave direction. The trilinear interpolation method can be applied as appropriate, which contains the bilinear and linear interpolation using the equation (a 1-a 2). Fig. A 1 shows 3D view of this trilinear interpolation process. In the time dimension, the linear interpolation method is used to fit the timestamp of the hydro-meteorological stream delivered from Weather now-cast data database linking to the timestamp of the AIS data stream. Furthermore, in the space dimension, the hydro-meteorological data could be interpolated on the ship point of ST based on the latitude and longitude of the hydro-meteorological stream and AIS data stream, using bilinear interpolation.
Where, ΔT j ,ΔLon j ,ΔLat j denote the amount change of time, longitude, and latitude of the hydro-meteorological data stream, respectively; Hydro (i,i,i) presents the hydro-meteorological data stream at the location (Lon i ,Lat i ) at the time i; Hydro j presents the hydro-meteorological data stream at ship point p j .
In specific, weather records included data in the following format: • Wind speed and direction from US NOAA -https://www.noaa.gov/ • Wave height, period and direction, tidal current, water level from Tidetech -https://www.tidetech.org/ • Ocean current from Mercator Ocean -https://www.mercator-ocean.fr Fig. A1. Interpolation method of hydro-meteorological data from the ship perspective Appendix B: Ship trajectory distance and CPA measures Firstly, the coordinate system must be converted from the earth-fixed (AIS) to struck ship-fixed (see Fig. B 1). Then, the dynamic data (location and speed) of the striking ships can be converted from in the new coordinate system defined by: Relations between value to be optimized and the value of p k+i j+i that is to be found a minimum Fig. B3. The distance of the striking ship and struck ship Fig. B 2). Aforementioned, the possible encounter stages are identified based on STs p k±m j±m from Tr i andTr i+y at the time k ± massociated with the minimum distance d min (p k+i j+i ) between pairs of ships. Distances are calculated considering the location of AIS onboard (Fig. B 3) and the length of ships, as follows: where β i is the relative bearing angle from striking to struck; θ i is the course of the encountered ships; (x 0 , y 0 ) and (x j , y j ) are the locations of two ships; Dist ij is the distance between the reference of AIS positions of two ships. The coefficients of equations (b 7) -(b 8) are defined based on AIS positions on the ship [30]. The DCPA and TCPA can be calculated based on the equations (b 10) -(b 13).
Where, C r represents the relative angle. S r denotes the relative speed.