Segmentation Method for Ship-Radiated Noise Using the Generalized Likelihood Ratio Test on an Ordinal Pattern Distribution

Due to the diversity of ship-radiated noise (SRN), audio segmentation is an essential procedure in the ship statuses/categories identification. However, the existing segmentation methods are not suitable for the SRN because of the lack of prior knowledge. In this paper, by a generalized likelihood ratio (GLR) test on the ordinal pattern distribution (OPD), we proposed a segmentation criterion and introduce it into single change-point detection (SCPD) and multiple change-points detection (MCPD) for SRN. The proposed method is free from the acoustic feature extraction and the corresponding probability distribution estimation. In addition, according to the sequential structure of ordinal patterns, the OPD is efficiently estimated on a series of analysis windows. By comparison with the Bayesian Information Criterion (BIC) based segmentation method, we evaluate the performance of the proposed method on both synthetic signals and real-world SRN. The segmentation results on synthetic signals show that the proposed method estimates the number and location of the change-points more accurately. The classification results on real-world SRN show that our method obtains more distinguishable segments, which verifies its effectiveness in SRN segmentation.


Introduction
To distinguish the statuses/categories of ships according to their radiated noise, we require audio segments to be homogeneous to extract consistent acoustic features. Therefore, audio segmentation is an essential procedure to deal with the diversity of the ship-radiated noise (SRN). The primary sources of SRN diversity in the real world are as follows: 1.
The SRN consists of a variety of components, including propeller noise, hydrodynamic noise, and noise from various mechanical parts radiated into the water through the hull [1].

2.
The traits of the SRN relate to the propulsion devices and operating states (entering or departing a port, waiting for boarding) of ships. 3.
The SRN varies while the ship is sailing nearby the hydrophone because the near field sound around the ship is not isotropic [2]. 4.
As the absorption coefficient changes with the distance between the hydrophone and the sound source [3], the proportion of the high-frequency and low-frequency in the SRN spectrum shifts when a ship is approaching or leaving.
The methods for audio signal segmentation are divided into two types: the model-based methods and the metric-based methods [4]. The model-based methods train a model to explore the subtle

Materials and Methodology
This section is structured as follows. In Section 2.1, we formulate the problem of audio segmentation and introduce the motivation of the proposed method. In Section 2.2, the preliminaries and estimation of OPD are reviewed. In Section 2.3, we proposed a segmentation criterion for single change-point detection by a GLR test [21] on the OPD. In Section 2.4, we provide an algorithm for multiple change-points detection using the proposed criterion and also a computation-efficient OPD estimation on a series of analysis window.

Problem Formulation and Motivations
We need to answer two basic questions in audio segmentation. First, is this audio signal homogeneous? Second, supposing it is non-homogenous, where does its characteristic shift? For an audio signal, we should estimate both the number of the change-points N and their location simultaneously. Formally, we assume the audio signal X = {x 1 , · · · , x T } is a piecewise stationary process, with N change-points in it. The full set of the unknown change-points is denoted as S CP = {j 1 , j 2 , · · · , j N } . (1) The N + 1 homogeneous segments split by the change-points are represented as Then, the audio segmentation is formulated as where C(j) is the similarity between the segments before and after a change-point j. As C(j) is computed from the probability distributions of the acoustic features, it depends on different choices of the acoustic features and their probability distribution. In this study, we perform change-point detection according to the OPD of the audio signal. The motivations are as follows: We use Π to denote the full set of the possible ordinal patterns as Π := {π 1 , π 2 , · · · , π m! }, (6) where each π ∈ Π corresponding to a specific order of the elements in X t [25]. Then, using Equation (5), X E is transformed into a sequence of π as where o t ∈ Π. The OPD describes the probability of the ordinal pattern o t taking each possible π. The corresponding probability mass function is where K = m! is the number of possible permutations, and m is the order of the ordinal patterns. p π k represents the probability of occurrences of π k and satisfies ∑ K k=1 p π k = 1. [] is the Iverson bracket, which takes a value of one when the condition in the parentheses is true; otherwise, it is zero.
In practical application of SRN segmentation, the procedure of mapping from X E to O π is computation-intensive, since the computation cost of permutation estimation is equal to that of sorting. Considering the overlaps of X t in X E , we use the left inversion count l [26] to construct an efficient mapping function. l i (x) is defined as the number of elements in a sequence x greater than x(i) before where #{} denote the number of elements in the set. According to the inversions of permutations [27], equivalent to Equation (5), the ordinal pattern can be represented with a sequence of left inversion counts, as According to Equation (4), let X t denote the last m − 1 elements in X t , which are identical to the first m − 1 elements in X t+1 , we compute the left inversion counts of X t as Then, o t+1 is estimated from l k (X t ) with at most m − 1 comparisons, as By computing the index of π in Π, X E is directly mapped to O π according to where d ∈ {1, 2, · · · , m}, and π d(X t ) is the ordinal pattern of X t . Finally, the coefficients in Equation (8) are estimated from the number of occurrences of each π in O π , as where k = 1, · · · , K. The whole process of the OPD estimation is shown in Figure 1.

Proposed Criterion for Single Change-Point Detection
The key ingredient of the single change-point detection is the criterion of whether a change-point exists in a signal. For simplification, we assume that there is at most one change-point in an analysis window with length T. X 0 = (x 1 , x 2 , · · · , x T ) is the audio signal in the analysis window. In case that a change-point j exists, the two segments separated by the unknown change-point j are According to Section 2.2, o 0 t , o 1 t , and o 1 t are the sequence of ordinal patterns corresponding to X 0 , X 1 , and X 2 , respectively. SCPD can be formulated as a hypothesis testing for model selection [28,29]. Based on the OPD, our hypothesis testing for SCPD is stated as follows. The null hypothesis H 0 states that o 0 t follows the OPD O(p 0 ), as where p 0 = (p 0 1 , p 0 2 , · · · , p 0 K ) is the parameters of the OPD, and satisfies ∑ K i=1 p 0 i = 1. The alternative hypothesis states that o 1 t and o 2 t follow two OPD with distinct parameters. Under the alternative hypothesis H 1 , the pre-change and post-change distribution are denoted by O(p 1 ) and O(p 2 ) as where p 1 and p 2 are the free parameters of O(p 1 ) and O(p 2 ). Using p 0 , p 1 and p 2 , which are estimated efficiently by Equation (14), we compute the log-likelihood functions of the null hypothesis H 0 and the alternative hypothesis H 1 as Substituting Equation (14) into Equation (18), we have where PE n = − ∑ K i=1 p n i log(p n i ) and n = 0, 1, 2. According to Wilks' theory [30], under the null hypothesis H 0 , when the sample size approaches infinity, the likelihood ratio asymptotically approximates the χ 2 distribution, and the degree of freedom equals the difference between the numbers of parameters in H 0 and H 1 . Therefore, the generalized likelihood ratio of our hypothesis test is approximated as where m is the order of the ordinal pattern. Then, the criterion for SCPD is established from Equation (20). As the estimation of LR varies with the location of the tentative change-point j, we use the maximum of LR to test the existence of the change-point. With a given significance level α, a change-point is detected if where ICDF represents the inverse cumulative distribution function of the χ 2 distribution. The location of the detected change-point is estimated aŝ Conversely, we reject the existence of a change-point if Equation (21) does not hold.

Computation-Efficient Multiple Change-Points Detection with a Variable Window
Audio segmentation is essentially an MCPD in practice. MCPD is more challenging than SCPD, as its main goal is to estimate the number and location of change-points simultaneously, which means exploring an ample segmentation space. Therefore, the calculation cost of the MCPD algorithm increases with the number of data points. The pros and cons of many MCPD algorithms have been reviewed in [31], including exhaustive search, stepwise selection, L1 penalization, and so on. Computation cost is an important consideration in ship-radiated noise processing because the data points per second are much more than that of the signals from physical dynamics or the economic process. In this section, we extend the proposed SCPD to the multiple change-points case and reduce the computation cost taking advantage of the sequential structure of the OPD.
We assume that the audio signal follows a piecewise stationary model, with an unknown number of change-points in the OPD. Under this assumption, by testing each data point as a candidate change-point, we generalize the hypothesis testing for a single change-point detection to the multiple change-points detection. The null hypothesis H 0 states that data point j is a change-point, while the alternative hypothesis states that the data point j is not a change-point. Obviously, for an audio signal, there is a large number of tests. In addition, as these tests follow a sequential structure, they do not belong to the typical multiple testing [32]. According to the sequential structure, the test is performed in a series of analysis windows, as where L is the the length of the analysis window, p 1 , and p 2 are the parameters of the two different OPDs, and p 1 = p 2 . Using Equation (23), we can efficiently estimate the OPDs over shifting windows in two steps. First, to avoid redundant hashing of an ordinal pattern from the same location, the hashing result at each data point is stored for repeated use in tests on different analysis windows. Then, the OPD is estimated in a manner similar to CUMSUM in [33,34] but very straightforward. Specifically, by computing the cumulative sum of the number of occurrences for each ordinal pattern point by point, the OPD is estimated from the difference between the two cumulative sums at the beginning and end of the analysis window, instead of counting the occurrences of each ordinal pattern in the analysis window. In this way, we obtain the number of occurrences for each ordinal pattern in each possible segmentation efficiently. Then, the cumulative sum of the number of occurrences for each ordinal pattern at time t is where C k,t represents the number of occurrences for the ordinal pattern π k during time 1-t. C k,t can be computed in an iterative manner, as where o t ∈ O π . Then, the corresponding permutation entropy PE(t 1 , t 2 ) of analysis window (t 1 , t 2 ] from the difference of Cum t 1 and Cum t 2 is An essential characteristic of the multiple change-points detection is its local nature. The OPD of an audio segment depends only on the ordinal patterns in the range from the previous change-point to the next change-point. The estimated distribution might be biased due to the use of ordinal patterns outside the specific range, which follows different probability distributions. Therefore, a shifting analysis window with fixed length as Equation (23) may deteriorate the segmentation performance. However, as the next change-point of the data point under test is unknown, it is infeasible to use only the ordinal patterns relevant to the current hypothesis testing. Considering the local nature of the multiple change-point detection, we use a searching strategy for change-point with a variable analysis window. At the beginning, we check whether there is a change-point in an analysis window of length W init . If a change-point exists, the precise location of the change-point is then estimated by Equation (21). Correspondingly, if no change-point detected, we grow the length of the analysis window in steps of W grow until it contains a change-point or reaches the maximum window length W max . After a successful detection of a change-point, we perform SCPD in a new analysis window of length W init begin from the detected change-point. If the length of the analysis window reaches the maximum with no change-point detected, we begin a new SCPD from the end of the previous analysis window. We repeat the test and obtain multiple change-points sequentially until the analysis window reaches the end of the signal. In Figure 2, we illustrate the local search strategy with a variable window. Another important consideration is that there exist many types of random noise with unknown distribution in the SRN. It is easy to detect false change-points in the region where the signal-to-noise ratio is low [35]. The false change-points will result in many additional small segments in the audio segmentation, which is prone to inducing errors in the processing of the SRN. Following [36], we add a minimum length constraint W margin in the proposed method for the MCPD. With the significance level α and a variable analysis window, the proposed method is illustrated in Algorithm 1.

Results and Discussion
In this section, we evaluate the performance of the proposed method on both synthetic signals and real-world SRN, by comparison with the BIC based segmentation methods [10].
The BIC based algorithm is the most widely used method in audio segmentation. Because increasing the number of model parameters improves the likelihood function but makes the model prone to overfit, the BIC method introduces a penalty factor λ related to the number of model parameters in the likelihood function, as where L is the likelihood function, X is the set of samples, M is a parametric model, m is the number of free parameters in the model, and N is the number of samples. According to Equation (27), we transform the audio segmentation into a model selection problem. In the case that no change-point exists in the sequence of short-term acoustic features, we use M 0 to model the statistical characteristics of the acoustic features in the analysis window. In the case that a change-point exists in the sequence of short-term acoustic features, we model the statistical characteristics of the acoustic features in the segments before and after the change-point with M 1 and M 2 , respectively. Then, the log-likelihood ratio of the two cases is If LR(i) > 0, the estimated change-point is located where the LR(i) reaches the maximum value. More detailed information about the BIC audio segmentation can be referred to in [10].
To make a fair comparison for MCPD, the BIC audio segmentation method and the proposed method share the same search strategy and hyperparameters. In addition, as the SRN varies mainly in amplitude and frequency, we choose the energy and the ZCR as the basic acoustic features. Then, we use the normal distribution to model the statistics of the acoustic features. In the following, we refer to the BIC based segmentation methods on the energy and the ZCR as the BSoE and the BSoZ, respectively. Compared to the ordinal pattern, the estimation of these two acoustic features requires a longer short-term window W short . We set the short-time window length W short as 50 to compute short-term acoustic features. Additionally, we set the step size of the window to one, so that the temporal resolution of the three methods are all one.
This section is organized as follows. In Section 3.1, we conduct experiments on synthetic signals. The performance of the proposed method is evaluated on three different types of signals that are generated, including signals with single change-point, signals without change-point, and signals with multiple change-points. In Section 3.2, we apply the proposed method on the ShipsEar dataset [37], and measure the segmentation performance by time-weighted classification accuracy of the segments.

Single Change-Point Detection
We generate three different types of signals to evaluate the performance of the proposed method for SCPD, namely y mix , y N , and y chirp . y mix consists of two different parts. The first part of y mix is Gaussian white noise, and the second part of y mix is a single-frequency signal contaminated by Gaussian white noise. The formula to generate y mix is where n 1 (t) ∼ N(0, 1) and n 2 (t) ∼ N(0, 3 20 ) are Gaussian white noise [38]. According to the coefficients in Equation (29), the power ratio of the first part to the latter one is 2:3, and, in the second part, the power ratio of the single-frequency signal to the random noise is 10:1. y N is Gaussian white noise with gradually increasing amplitude, and y chirp is a chirp signal with gradually increasing frequency. The formulas to generate y N and y chirp are where n(t) ∼ N(0, 1) and t = 1, 2, · · · , T.
In these experiments, y mix approximates the SRN with a line spectrum submerged in ambient noise, y N approximates the ambient noise with gradually increasing amplitude, and y chirp approximates the SRN generated from a propeller whose rotation speed is increasing. In the latter two cases, no change-point exists in the signal. The three types of synthetic signals are shown in Figure 3. Synthetic signals generated for performance evaluation for single change-point detection (SCPD). Figure 4 shows the results of SCPD on a random realization of y mix , which has a change-point at j = 2345. Though the locations of the estimated change-points vary from each other, all three of the methods achieve a satisfactory accuracy. Furthermore, to lower the bias due to the random realization of y mix , we generate 50 different realizations of y mix with change-points randomly located in the range (500, 4500). The performance of the three methods for SCPD is measured by j −ĵ, the distance of the estimated change-point from the true change-point. The distributions of j −ĵ are shown by the box-and-whisker plots in Figure 5.  As shown in Figure 5, the mean of j −ĵ computed by our method is approximately 0, while that of the BSoE and the BSoZ are about five and 15, respectively. The comparison of the mean of j −ĵ indicates that our algorithm detected change-point with more precise locations. One explanation for this is that both the BSoE and the BSoZ require a short-term window W short to calculate the acoustic features. Instead, the proposed method computes ordinal patterns from only m data points on the waveform. As m W short , our method has a higher temporal resolution. In addition, Figure 5 also shows that the proposed method has the smallest standard deviation of j −ĵ, which indicates the robustness of the proposed method.
As there is no change-point in y N or y chirp , we investigate the log-likelihood ratio (LR proposed , LR BSoE , LR BSoZ ) of the hypothesis test. The log-likelihood ratio at each data point in y N and y chirp is shown in Figures 6 and 7, respectively. The existence of change-point is tested by whether the log-likelihood ratio exceeds the corresponding critical threshold. According to Equation (21), the critical threshold of the proposed method relies on the significance level α. Three critical thresholds are shown in Figures 6 and 7, corresponding to α = 0.01, α = 0.02, and α = 0.05. According to Equation (28), the thresholds of the BSoE and the BSoZ are both 0. Figure 6 shows that the BSoE detects a change-point in the middle of y N , whereas the BSoZ reports no change-point because the frequency of y N does not change over time. The log-likelihood ratio of the BSoZ on y N is generally below zero, except that a false change-point detected at the end of y N , where LR BSoZ > 0. As shown in Figure 7, the BSoZ reports a change-point in the middle of y chirp , where the BSoE detects no change-point as the log-likelihood ratio is below zero. In addition, the BSoE overestimates the log-likelihood ratio at the beginning of y chirp .
Overall, the BSoE estimates a change-point in the middle of y N , while the BSoZ locates a change-point in the middle of y chirp . Both the BSoE and the BSoZ overestimate the log-likelihood ratio at the beginning or end of the signal because the length of the two tentative segments is significantly different. Since the mean and the variance among neighboring data points are not considered in ordinal pattern analysis [23], the proposed method detects no change-point on both y N and y chirp , which is favorable in SRN segmentation. Additionally, compared with the BSoE and the BSoZ, the log-likelihood ratio of our method exhibits smaller deviations, which implies its robustness to random noise.
We establish three metrics, e num ,ē and e max , to measure the performance of the three methods for MCPD. As false change-points are inevitable in the presence of noise [39], the estimated change-points do not correspond one-to-one with the true change-points. In this experiment, for a true change-point p t , the estimated change-point closest to it is selected as its estimation, denoted as p r . The performance metrics are calculated according to where N p is the number of true change-points, N e is the number of estimated change-points, and e num is the difference between N e and N p .ē and e max are the mean and maximum of the bias |p t − p r |, respectively. e num reflects the robustness of MCPD, andē and e max measure the accuracy of the estimated change-points collectively. The three methods are tested multiple times with different initial window length W init and window growth length W grow , in order to evaluate their performance for MCPD. In addition, in the test on a realization of y multi , the three methods share the same W init and W grow . According to the assumption in SCPD, W init is chosen to avoid the two nearest change-points (j = 4500 and j = 5000) included in one analysis window. Specifically, for y multi , the critical value of W init is 2000. Five different W init are used in the tests, i.e., 500, 1000, 1500, 2000, 2500, and the corresponding W grow are set as 1/4 or 1/2 of the W init . For each combination of W init and W grow , we apply the three methods on 50 random realizations of y multi and list the performance metrics in Table 1.
The e num in Table 1 indicates the accuracy of the estimated change-point number. When W init is below the critical value, the larger the value it takes, the smaller the e num . In the case W init equals 2000, the e num of all the three methods reach the minimum. In addition, the proposed method achieves the least e num , about 1/10 of the other two. With W init below 2000, the e num of the proposed method is significantly smaller than that of the BSoE and the BSoZ. If W init violates the assumption in SCPD (W init = 2500), the e num of the proposed method is larger than that of the other two methods. Additionally, W grow also affects the results of MCPD, but only as a fine-tuning of W init .
Theē and e max in Table 1 measure the accuracy of the estimated change-point location collectively. Theē of the BSoE and the BSoZ are small in the case that W init = 500 or W init = 1000. This is because there are more extra change-points in the results, which may decreaseē and e max , according to Equation (32). Theē and e max of all the three methods become significantly large when W init is above the critical value (W init = 2500). When W init takes other values, it has no noticeable effect onē and e max . In the cases that W init follows the assumption in SCPD,ē and e max of the proposed method are the smallest.
Overall, the results in Table 1 verify the effectiveness of our method on detect multiple change-points, even with the not well-tuned parameters.  Figure 9. Primary stages of segmentation and classification of ship-radiated noise (SRN).
As the spectra of the SRN are concentrated in a range of low frequencies, we downsample the original signal to 6000 Hz in the preprocessing stage. With a lower sample rate, the computation cost for subsequent audio segmentation and classification is reduced while the informative characteristics of the SRN are preserved. Then, the original ship types are combined into four classes based on their tonnage [37], as the record number of some ship class is far fewer. For instance, there are only one and two records for the Trawler and the Tugboat, respectively.
In the segmentation on elementary features, we apply the BSoE, the BSoZ, and the proposed method on the SRN. Both the short-term window length and hop length in the BSoE and the BSoZ are 0.1 s. There is no overlap between adjacent windows as the window length equals the hop length. Both the energy and ZCR are computed from a window with 600 data points. In the proposed method, ordinal patterns are computed with order m = 3 and time delay τ = 1, according to [24]. In addition, all the parameters for MCPD in the three methods are set to the same value. Specifically, W init is 10 s, W step is 2.5 s, W margin is 2 s, and W max = ∞. W max = ∞ means that, by continuously increasing the length of the analysis window, the search for the next change-point restarts only when a valid change-point is found. Figures 10 and 11, respectively, show the results of MCPD conducted on the two typical SRN records. The first record comes from a passenger ship entering the port, and the second comes from a pilot boat passing by the hydrophone. The spectrums are calculated from the amplitudes of the short-time Fourier transform, with a short-term window length 512 sampling intervals and a hop length 256 sampling intervals. Figures 10 and 11 show both the waveform and the corresponding spectrum of the two records, describing the signal from aspects of the time domain and frequency domain. The dotted lines show the locations of the estimated change-points. Shifts in both the waveforms and the corresponding spectrums of the two records are evident, but no apparent change-point exists. The three methods obtain different change-points using distinct elementary features and criteria. The BSoE tends to detect a change-point where the amplitude changes while the BSoZ inclines to report a change-point where the spectrum changes. Although the proposed method does not make use of the amplitude and frequency of the signal, it also obtains satisfactory segmentation results according to the OPD. In the detailed feature vectors' extraction, we extract feature vectors from the obtained audio segments using a two-stage feature extraction approach [40]. Firstly, we split each audio segment into short-term parts with equal length, and calculate acoustic features from each part. Then, we compute the statistics of the acoustic features and combine them into a detailed feature vector. Specifically, with non-overlap short-term windows of length 50 ms, we extracted a set of acoustic features using the Librosa toolkit [41], including the shape characteristics of the spectrum (centroid, bandwidth, contrast, flatness, and roll-off), the second-order polynomial coefficients of the spectrum, MFCCs, and the chroma features. There is a total of 39 elements in the acoustic features of an audio segment. For every element of the acoustic features, we calculate its mean and standard deviation and combine them into a detailed feature vector of dimension 78. Finally, these detailed feature vectors are constructed as a dataset for SRN classification.
Because the audio segmentation bases only on several primary metrics, there may exist some false change-points. Therefore, we refine the result of audio segmentation according to the similarity among detailed feature vectors. In this experiment, the similarity between adjacent samples is measured by the Euclidean distance between corresponding feature vectors. If the Euclidean distance below a specified threshold, we regard the change-point between them as a false change-point and merge the two audio segments into one. In this way, we gradually increase the threshold until we obtain 1500 samples from the result of each method. Figure 12 shows the time durations of the obtained refined segments. The width of each bin in the histogram is two seconds. The segments longer than 18 s are not included in the histograms because the corresponding counts are relatively few. The time duration of the segments obtained by the proposed method mostly range from three seconds to nine seconds. In addition, compared with the other two approaches, the proposed method obtains fewer irregular audio segments. We use two classifiers in the classification: the support vector classifier (SVC) and the random forest (RF) classifier. The two classifiers generally achieve high performance on most types of data, which is favorable to assess the quality of the obtained segments. For the SVC, we choose radial basis function as the kernel function, with regularization parameter C = 1 and kernel coefficient γ = 1/n f , where n f = 78 is the dimension of the sample, as mentioned before. For the RF, we use Gini impurity as the indicator in the partitioning, and each RF contains 100 decision trees.
Because the time duration of the samples varies in a wide range, the classification accuracy is weighted by the time duration of the segments, which presents the time proportion of correctly classified segments in the total signal. In addition, we use stratified 10-fold cross-validation to assess the classification accuracy. Table 2 shows the weighted classification accuracy of the segments obtained by the three methods. The proposed method achieves the highest classification accuracy, either with the SVC or the RF classifier. The comparison of the classification accuracy implies that the proposed method obtained higher-quality segments of SRN. Additionally, the standard deviations of the classification accuracies obtained by the proposed method are small, which means the proposed method generates few irregular segments from SRN.

Conclusions
In this paper, we propose an audio segmentation method for SRN to improve the identification performance of ship statuses/categories. Based on the OPD, we establish a criterion for change-point detection and apply it to the SCPD and the MCPD. By comparison with the BSoE and the BSoZ, we evaluate the performance of the proposed method on both synthetic signals and real-world SRN. For the synthetic signals, the evaluation results show that the proposed method estimates the number and location of the change-points more accurately. For the real-world SRN, according to the classification results obtained by SVC and RF, the proposed method achieves the highest mean classification accuracy with a small standard deviation, which verifies the effectiveness of the proposed segmentation method.
The advantages of the proposed segmentation method for SRN are summarized as follows: