Heterogeneous integrated dataset for Maritime Intelligence, surveillance, and reconnaissance

Facing an ever-increasing amount of traffic at sea, many research centres, international organisations, and industrials have favoured and developed sensors together with detection techniques for the monitoring, analysis, and visualisation of sea movements. The Automatic Identification System (AIS) is one of the electronic systems that enable ships to broadcast their position and nominative information via radio communication. In addition to these systems, the understanding of maritime activities and their impact on the environment also requires contextual maritime data capturing additional features to ships' kinematic from complementary data sources (environmental, contextual, geographical, …). The dataset described in this paper contains ship information collected through the AIS, prepared together with spatially and temporally correlated data characterising the vessels, the area where they navigate and the situation at sea. The dataset contains four categories of data: navigation data, vessel-oriented data, geographic data, and environmental data. It covers a time span of six months, from October 1st, 2015 to March 31st, 2016 and provides ship positions over the Celtic sea, the North Atlantic Ocean, the English Channel, and the Bay of Biscay (France). The dataset is proposed for an easy integration with relational databases. This relies on the widespread and open source relational database management system PostgreSQL, with the adjunction of the geospatial extension PostGIS for the treatment of all spatial features of the dataset.


a b s t r a c t
Facing an ever-increasing amount of traffic at sea, many research centres, international organisations, and industrials have favoured and developed sensors together with detection techniques for the monitoring, analysis, and visualisation of sea movements. The Automatic Identification System (AIS) is one of the electronic systems that enable ships to broadcast their position and nominative information via radio communication. In addition to these systems, the understanding of maritime activities and their impact on the environment also requires contextual maritime data capturing additional features to ships' kinematic from complementary data sources (environmental, contextual, geographical, …). The dataset described in this paper contains ship information collected through the AIS, prepared together with spatially and temporally correlated data characterising the vessels, the area where they navigate and the situation at sea. The dataset contains four categories of data: navigation data, vessel-oriented data, geographic data, and environmental data. It covers a time span of six months, from October 1st, 2015 to March 31st, 2016 and provides ship positions over the Celtic sea, the North Atlantic Ocean, the English Channel, and the Bay of Biscay (France). The dataset is proposed for an easy integration with relational databases. This relies on the widespread and open source relational database management system PostgreSQL, with the adjunction of the geospatial extension PostGIS for the treatment of all spatial features of the dataset.
© 2019 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons. org/licenses/by/4.0/).

Data
The dataset described in this paper contains ship information collected through the Automatic Identification System (AIS), validated and integrated with correlated contextual data characterising the vessels, the area where they are navigating and the concomitant sea conditions. It extends over six months, from October 1st, 2015 to March 31st, 2016 and covers the Celtic sea, the North Atlantic Ocean, the English Channel and Bay of Biscay (France). Value of the data The dataset is realistic and comprehensive, representative of operational information needs in different Maritime Intelligence, Surveillance, and Reconnaissance scenarios, including safety and security of navigation and preservation of protected areas. It will suit various researches and potential applications and reuses, including: Maritime traffic analysis: The AIS data volume is large enough to support the development of sea traffic models to characterise the maritime traffic and for estimating other patterns of life. Impact assessment of human activities at sea: The dataset may provide a valuable support for understanding the impact of human activities at sea, facilitating the development of sustainable fishery policies (e.g., by developing fishing pressure models 11 ). Environmental assessment: It might be beneficial to ocean dynamics estimation study, to assess the quality of specific ocean state variables (e.g., currents). Assessment and benchmarking of multi-source information fusion algorithms: The dataset gathers highly heterogeneous data, mixing raw and processed information, archival data and streaming data. Since these heterogeneous data are aligned in time and space, they are a perfect test bed to experiment multi-source information fusion algorithms.
In addition, the dataset is becoming a reference benchmark AIS dataset, profitable also for maritime information training activities, because the included AIS data are of high quality and have been validated, and being a mix of proprietary and public datasets, it would contribute to increase the value of public institutional data. Fig. 1 illustrates the spatial coverage of the dataset: the dashed lines correspond to the spatial bounding box of the data, 2 and the dark blue points represent the reported vessel positions. The location of the terrestrial AIS receiver is shown in Fig. 3, together with its associated theoretical coverage. Other navigation-related data are AIS status, codes, and types. Ship type is illustrated in Fig. 2. The dataset includes also vessel-oriented data (i.e., vessel registry), geographic data (i.e., spatial layers useful to contextualise vessel positions), and environmental conditions. Fig. 4 illustrates an example of the maritime protected areas included in the dataset (namely, Natura 2000). Fig. 5 correlates oceans conditions parameters (i.e., water depth, date, mean wave length, height of wind and swell waves, wave mean direction, sea surface height) along 3 days.
The dataset is composed by 34 data files in Comma-Separated Values (CSV) format and 15 geographic layers (as ESRI Shapefiles format), each provided with detailed meta-information specifying: source and originator, licence, spatial reference identifier (SRID), spatial coverage, temporal range, volume, together with a structured description of the data fields and types. Tables 1e4 summarise these meta-information by data category. The dataset is proposed with integration scripts and instructions to facilitate its insertion in a PostgreSQL/PostGIS relational database.

Experimental design, materials and methods
The following sections first provide a description of the four categories of data that constitute the dataset and how they have been collected and prepared: navigation data, vessel-oriented data, Fig. 1. Navigation data (purple background represents fishing areas computed by Ref. [2]). 1 Brest roadstead in which the AIS receiver is located has a local regulation enforcing fishing vessels to use AIS permanently, thus providing high-quality positional information. 2 The bounding box of the dataset has the following coordinates: Longitude between À10.00 and 0.00 and Latitude between 45.00 and 51.00. geographic data, and environmental data. Secondly the paper explains the experimental process and concludes presenting the broad impact of the dataset.

Navigation-related data
The AIS communicates 27 kinds of messages, each one having its own purpose in information transmission (positioning, nominative information, management …). The messages are broadcast in a theoretical range of circa 40 nautical miles. These AIS messages are binary messages that comply with  3 In the dataset, two main classes of messages are considered: positioning and nominative information.

Positioning and nominative information
Positioning messages (ships' dynamic messages): Several AIS messages provide vessel positions that are acquired automatically by AIS transponders using embedded sensors (typically, GPS, gyroscope, loch, compass). To build the dataset, the ITU message types ITU 1, ITU 2, ITU 3, ITU 18, and ITU 19 were selected from which the following fields have been extracted: The Maritime Mobile Service Identity (MMSI) which is an international and unique ship identifier; The coordinates (longitude and latitude expressed using WGS84 reference system); The associated Speed Over Ground (SOG) in knots; The true heading in degrees (relative to true north) and the associated Course Over Ground (COG) also in degrees (direction of motion); The rate of turn, when available, expressed in degrees per minute; The navigational status, an integer encoding the current motion status of the ship (e.g. anchored, on the way sailing …).
These messages constitute an ordered time series. However, AIS messages do not embed the timestamp of the emission. Therefore, each message has been timestamped, with an integer corresponding to a UNIX epoch time 4 upon reception by the receiving workstation (cf. Section 2.5). Fig. 1 illustrates the content of positioning messages reported on a map. The dashed lines correspond to the bounding box of data.
Nominative messages (ships' static messages): Several message types provide ship meta-information such as the ship name or voyage-related information. Some of these data are fully static (i.e. set at the initialisation of the AIS device onboard) and will not change during ship's life (e.g., ship dimension), while few others can evolve (e.g., the name can change depending on the owner) with varying frequency (e.g. destination should be updated at each different voyage). These fields are manually set and thus subject to errors, imprecisions or simply not fed. Nominative information collected for this dataset contains: The MMSI, and the ship identification number (an integer) provided by the International Maritime Organization (IMO); The international radio call sign (a string of characters); The name of the vessel (a string of characters) and associated ship type encoded with an integer;  The reference point for reported position (biggest ships can reach 400 m length) and overall dimensions of the ship expressed with four integers describing a rectangle based on the reference point.
In addition, voyage-related information contains: The destination (next port of call) of the current trip (a string of characters manually entered) and associated Estimated Time of Arrival (ETA) expressed in format month-day-hour-minute (using the Coordinated Universal Time with time zone); The draught of the ship for the current voyage (expressed in meters by a real number between 0.1 and 25.5).
These fields have been extracted from ITU messages of types ITU 5, ITU 19, and ITU 24. Similarly to dynamic messages, each message has been timestamped with UNIX epoch time at the receiving workstation.
Positioning messages (search and rescue dynamic messages): Several Search And Rescue (SAR) aircrafts operating at sea are also equipped with AIS transceivers. These particular emitters provide their locations using message type ITU 9. The information extracted from SAR messages contains: The MMSI; The coordinates (longitude and latitude expressed using WGS84 reference system); The associated speed over ground in knots; The course over ground expressed in degrees (with respect to the true north); The altitude of the search and rescue aircraft (an integer ranging from 0 to 4094 m).
Each message has been also timestamped with UNIX epoch time at the receiving workstation. Positioning messages (aids to navigation dynamic messages): Dynamic data of the AIS also include locations of Aids To Navigation (AtoN), typically buoys and lighthouses. An aid to navigation can be physical or virtual. Virtual AtoN do not exist physically and can be useful in time-critical situations and in marking/delineating dynamic locations or areas where navigational conditions change regularly. This equipment communicates information though the message type ITU 21. Information extracted from AtoN messages contains: The MMSI; The type of the aid to navigation encoded with an integer and associated name (a string of characters); The coordinates (longitude and latitude expressed using WGS84 reference system); The nature of AtoN (a Boolean value to distinguish between virtual and physical AtoNs).
Each message has been also timestamped with UNIX epoch time at the receiving workstation.

AIS status, codes and types
Some data fields of AIS messages are encoded, usually with integer codes. To facilitate the understanding and the analysis of AIS data, these enumerations and the associated information have been included in the dataset, in Comma-Separated Values (CSV) files.
Status: The navigational status provided by positioning messages is encoded by an integer corresponding to 16 different status which names are themselves described by predefined strings of characters (e.g. moored, under way).
Country Codes: Each ship is registered in a country (flag). The country of each ship is encoded in the first three digits of each MMSI number (e.g., '227' is used for France). Country codes and names (string of characters) are included in the dataset.
Ship Types: The ship type is encoded by an integer (Fig. 2) in nominative messages (e.g., '30' is the code used for fishing vessels). The correspondence between 38 codes (integer) and the ship types (string of characters) as described in Ref. [9] is provided in this dataset. Additionally, a list of 233 refined ship types has been provided for advanced classification (this extended list is based on a commercial service's list 5 ).
AToN: The type of aid to navigation (e.g., floating vs. fixed buoy, light, beacon) is encoded by an integer in the ITU 21 dynamic messages. This CSV data file provides a textual description (nature and type) for these codes.

AIS receiver
Receptor location: The AIS messages have been collected using a single terrestrial receiver. The position of this receiver is given in the dataset and displayed by the yellow star in Fig. 3. It is a twodimensional geometry of type point (i.e., its altitude is at sea level) with coordinates (longitude and latitude) expressed using the WGS84 reference system. Theoretical coverage of the receiver: Each terrestrial AIS receiver has a theoretical coverage that depends on its location and the surrounding topography. 6 The theoretical coverage of the AIS receiver is given as a geometrical polygon with coordinates (longitude and latitude) expressed using the WGS84 reference system (the light blue polygon in Fig. 3). The receiver theoretical coverage has been calculated using the definition of sea areas from IMO resolution A801 (19). It computes a circle of radius R nautical miles where R is equal to the transmission distance between a ship's VHF antenna at a height of 4 m above the sea level and the VHF antenna of the coastal station (at a height of H meters) which lies at the centre of the circle. The coverage has been computed with an antenna at 70 m above the sea level (66 m being the altitude of the location, plus four additional meters for the height of the building where the antenna is located).

Vessel data
Other nominative vessel information is available in official registers. For the vessels navigating in the area covered by the dataset, two of the most relevant institutional registers are included in this dataset.

Community Fishing Fleet Register (European Commission) 7
The European Commission freely provides a list of all fishing vessels that fly a flag of one of the countries of the Union. This fleet register concerns fishing vessels only which represent a fraction of the navigating vessels. The data fields that can be matched with AIS data (i.e., same field either present or inferable within both datasets) are the international reference call sign, the name, and the vessel length. The fleet register also includes few technical details like length, gear type (61 different types), year of construction and engine power of the vessels.

Civilian ships registered by ANFR (Agence Nationale des Fr equences) 8
The fleet register provided by the French Frequencies Agency (ANFR) gathers a large number of French-registered vessels. The data fields include several normative information (MMSI, IMO numbers, registration numbers, ship name) that can be matched with AIS fields. The dataset also provides characteristics of the ship (type, length, and tonnage). Communication facilities (e.g. VHF) and information about ship licence (active, not active, and associated dates) are also described.

Geographic data
Geographic data provide complementary information to vessel movement data about topographic or regulatory context of vessel navigation.

Ports
A lot of ships (e.g., tankers, cargos, passengers, ferries) have origins and destinations corresponding to ports. Having official lists of ports is therefore crucial. It can be used for instance to disambiguate the Destination field of AIS message 5 (filled manually). The dataset encompasses a detailed list of local ports and two worldwide lists (passing traffic around Brittany include world destinations).
Ports of Brittany. 9 This dataset proposed by the Brittany region gathers the location of 222 ports around Brittany with names and coordinates expressed in the WGS84 system. A French law transferred, between 2015 and 2017, the management of ports from the region to departments. The dataset 6 Let us note that the practical coverage of an AIS receiver evolves regularly, mainly depending on weather conditions [11].
Additionally, some AIS messages may be propagated by AIS repeaters. Both affect the practical coverage. 7  has been modified accordingly. In 9 , only ports still belonging to Britany region are proposed, however two of the four departments of Britany propose equivalent datasets, 10,11 . World Port Index (National Geospatial-intelligence Agency). 12 The World Port Index (WPI) is a publication of the National Geospatial-intelligence Agency, which contains location and physical characteristics of, and the facilities and services offered by major ports and terminals world-wide. About of 3700 ports throughout the world are included in this dataset.
SeaDataNet Ports Gazetteer. 13 This dataset provided by EMODnet, the Pan-European Infrastructure for Ocean and Marine Data Management 14 focuses on halieutic ports. It contains names and coordinates of almost 5000 fishing ports throughout the world. 15 The knowledge of the coastline is essential for the understanding of maritime movements. This dataset is a high resolution (1:100,000 scale) coastline of European shores (polylines and polygons, as shapefile), created by the European Environmental Agency (EEA) enabling highly detailed spatial analysis, such as the assessment of the proximity of vessels to coastlines or islands in support to maritime safety. It is a data derived from two sources: EU-Hydro 16 and the Global Self-consistent, Hierarchical, High-resolution Geography Database (GSHHG). 17

Sea areas (International Hydrographic Organization) 18
The dataset contains the main seas of the world (101) as polygons representing contiguous bodies of water, regionalised and divided into maritime basins. It can be useful to determine if two vessels are in the same region of the world. Areas covered (polygons concerned) by the navigation data are the Celtic sea, the North Atlantic Ocean, the English Channel and the Bay of Biscay.

FAO Major Fishing Areas (Food and Agriculture Organization) 19
This dataset contains the worldwide fishing regions established by the Food and Agriculture Organization (FAO) of the United Nations (UN). The boundaries were determined in consultation with international fishery agencies considering the distribution of the natural resources, national practices and boundaries, international conventions. It can be used for example to assess the declared provenance of fish against the areas in which a vessel effectively sailed or exhibited a fishing behaviour. 20 All the Exclusive Economic Zones (EEZs) of the world are gathered in this dataset. This can be useful to determine the quality of some at-sea operations as well as the competent court in some activities. Exclusive Economic Zones Boundaries are given as polygons and polylines. Areas beyond this boundary can be classified as high seas.

European maritime boundaries (European Environment Agency) 21
The dataset provided by the European Environmental Agency contains maritime boundaries in Europe that include territorial waters, bi-or multi-lateral boundaries as well as contiguous and exclusive economic zones (EEZs).

Natura 2000 areas (European Environment Agency) 22
The European database on Natura 2000 collects the reporting of the European protected areas, as an ecological network developed for the preservation of species and habitats (terrestrial and maritime areas). The dataset is managed by the European Environmental Agency and is built upon data submitted by European member states. The dataset contains descriptive data (e.g. the list of all species and habitat types) and spatial data (borders of sites). The version included in the dataset covers 2017 reporting. Fig. 4 shows areas of this data file.

European fishing areas (European Commission Joint Research Centre) 23
This dataset was created by the European Commission Joint Research Centre (JRC) and provides the fishing grounds in European waters, which can be used to assess the behaviour of fishing vessels. An assessment on the fishing pressure can also be extracted from this database. Fishing grounds have been derived from ship AIS positions collected along one year (from September 2014 until September 2015) and emitted by selected categories of fishing vessels exhibiting a fishing behaviour. Raw AIS data are originated from Volpe Centre of the U.S. Department of Transportation, the U.S. Navy, and MarineTraffic [2].

Fishing constraints
This dataset contains two geographic areas where shellfish fishing activity is forbidden in the time window of the dataset.

Environmental data
Weather data and ocean data from forecast models and from observations (e.g., in-situ sensor data), which are openly available from several providers (see Table 4), can help validate analysis results and explain abnormal behaviour. For example, sea and weather conditions can force vessels to change  direction or modify their normal route. They can also be used to characterise seasonal trends in traffic routes and to contextualise vessels kinematics such as speed.

Ocean conditions
The ocean conditions were extracted from an hindcast database built with a WAVEWATCH III model [7] and provided by IFREMER. 24 Initially built for the design of marine energy converters, it can also be used to study ships behaviour. Therefore, only the parameters relevant to this context were selected and stored in 6 CSV files (one file per month). The values are provided every 3 hours for each point of a regular spatial grid over the dataset bounding box. The grid spacing is 2 minutes, for both latitude and longitude. The parameters include coordinates (longitude and latitude expressed using the WGS84 reference system); bottom depth in meters; sea surface height above sea level in meters (tidal effect); significant height of wind and swell waves; mean wave length in meters; wave mean direction; and a timestamp expressed using epoch time (Fig. 5).

Weather observations 25
The coastal weather observations were recorded by 16 stations located in the south of England and along the French coasts. The data have been cleaned and formatted in the same way for the 16 selected stations, providing the most relevant fields such as temperature, atmospheric pressure, wind direction/speed and horizontal visibility, every hour over the 6 months period. The unformatted or potentially biased fields, especially those based on the human perception, have been removed.

About coverage and time period
At the core of this heterogeneous dataset are AIS messages received by a terrestrial station located in the Brest roadstead, France (Fig. 3). This station is conveniently in direct sight of the roadstead bottleneck in order to cover the traffic in the roadstead, including the inbound, outbound movements and the traffic passing-by the Ushant Traffic Separation Scheme 26 (TSS). In addition, local regulations from maritime authorities enforce fishing vessels to use AIS permanently, ensuring a rich-set of high-quality positional information [8]. The selected six-month period, from October 1st, 2015 to March 31st, 2016, is tied to fishing activities, and corresponds to shell fishing rights.

AIS data
Statistics. The total number of messages received by the station is 24.033.893, from which 19.152.196 (79.7%) messages have been selected, to include vessels positions, aids to navigation and search and rescue. Among these selected messages, 1.032.187 are static messages (4.3%). 70.5% of vessels positions are located in a range of 10 km from the antenna. Although the theoretical maximum distance covered by the antenna is about 50 km (line of sight), the original dataset contains messages emitted at 700 km, due to the atmospheric ducting effect.
Data cleaning. AIS messages are well defined but sometimes erroneous because of system faults, data errors or deliberate manipulation. The dataset has been cleaned from errors resulting in 125.202 messages being discarded (0.5%) by the AIS parser (cf. next section). AIS management messages have been filtered out as well. Apart from errors and irrelevant messages, decoded AIS data are provided as received, including duplicates and other veracity issues that can be used to challenge the algorithms.

AIS processing
Raw messages. Most AIS messages come without any timing information because the AIS has been initially designed as an anti-collision system to be used lively. The receiving station timestamps the messages immediately upon reception, in UTC format, and this timestamp is included in the navigation and nominative data. In a second step, the receiver broadcasts the timestamped messages, in the raw and unparsed format received to a central server which stores all the messages (i.e., all the 27 different message types described in the ITU-R.M 1371-4 or NMEA 4.0 specification) in text files (one file per day).
Processed messages. The text files corresponding to the AIS messages received during the selected time period (October 1st, 2015 to March 31st, 2016) have been parsed and decoded. The parser, written in Java, extends and adapts the CC-BY-NC-SA 3.0 parser aismessages. 27 The additional functionalities developed include: the connection to the database, ingestion, and outputting of UDP (User Datagram Protocol) streams and files (used by the AIS), automatic folder import, data logger for unparsed messages, data export translated to the standardised TAG Block format. 28 The parsed data have been stored in a spatio-temporal database to facilitate the next processing steps. The data processing uses the widespread and open source relational database management system PostgreSQL, 29 and its geospatial extension PostGIS 30 for the treatment of spatial features. The data model of the developed AIS database encompasses a single schema, containing one table per message type. The AIS data included in the dataset are extracted from this database. Two files, aggregating different message types, have been created, containing respectively ships' positioning messages (messages ITU 1, ITU 2, ITU 3, ITU 18, and ITU 19) and nominative messages (messages ITU 5, ITU 19, and ITU 24). Two additional files containing respectively search and rescue messages (ITU 9) and aids to navigation messages (ITU 21) have been also extracted from the database. During the aggregation process, messages have been spatially filtered using the bounding box illustrated in Fig. 1.
Only message fields with navigation information have been selected, excluding few technical fields.

Other data
For complementary data to the terrestrial AIS dataset (i.e. Sections 2.2, 2.3, 2.4) we favoured public sources like European institutions and projects [1], including: SeaDataNet, 31 Copernicus, 32 IFREMER, 33 the EU Science Hub 34 and EMODnet. 35 Several of these data sources have been processed (e.g. original 1462 ocean condition files [7] have been aggregated in 6 files, one per month) mainly to facilitate their use. Only fishing areas [2] have been modified to limit the coverage to the bounding box illustrated in Fig. 1.
Additionally, nautical charts can be very useful for the understanding of maritime navigation. IHO S-57 (and its revision S-100) is the current IHO standard for digital hydrographic data. It defines a data model including 159 geo-object classes and the data structure and format used to implement it. These charts cannot be shared, but a technical note describing useful S-57 nautical charts and objects, including the necessary scripts to process them, has been prepared and proposed in Ref. [10].

Data integration
To facilitate the integration and exploitation of the dataset, the database model used to prepare the data, including the SQL queries and the scripts for data integration, is proposed as part of the dataset (readme file, Windows and Linux scripts and needed SQL scripts). This relies on the relational database management system PostgreSQL and its PostGIS extension.

Maritime dataset ecosystem
The dataset described in this paper has been carefully prepared and validated, checking for errors, in order to offer the research community a set of heterogeneous real data to challenge, test and validate their research developments. The dataset was prepared to answer specific research questions on data integration and processing within the activities of the datAcron H2020 project (datacron-project.eu) [4e6]. It supported the development and assessment of real-time detection and prediction of trajectories and complex events related to moving entities 36