Unsupervised Clustering of Argo Temperature and Salinity Proﬁles in the Mid-latitude Northwest Paciﬁc Ocean and Revealed Inﬂuence of the Kuroshio Extension Variability on the Vertical Structure Distribution

In the mid-latitude Northwest Paciﬁc Ocean, subtropical and subarctic waters meet, intermingle, and mix to form multiple ocean domains characterized by a variety of unique oceanographic structures. The majority of previous studies on the characteristics and distribution of these ocean structures were based on existing deﬁnitions of the structure and water mass of interest derived from spatiotemporally limited data. With the massive amount of data accumulated by Argo and a small spatiotemporal bias, data-driven scientiﬁc methods may be able to objectively recapture the ocean structure. In this study, we used unsupervised clustering to analyze the temperature and salinity proﬁles from Argo ﬂoat in the mid-latitude Northwest Paciﬁc Ocean, and the results were compared to previous knowledge of ocean structure. The results showed that classes were distributed to form speciﬁc regions and each class has diﬀerent oceanographic features that generally correspond to the previously described regional divisions. A striking advantage of this new method is that it quantiﬁes the relative abundance of each class of proﬁles on a 1 ° grid. Furthermore, we discovered that the dynamic state of Kuroshio Extension aﬀects the distribution of some classes and the percentage of proﬁles that are robust to clustering. This suggests that the stability of the Kuroshio Extension ﬂow path has an eﬀect on the distribution of ocean structure. These ﬁndings suggest that unsupervised clustering is useful in the analysis of oceanographic structures, allowing us to investigate oceanographic structures in areas that have not previously been extensively studied.

global ocean, a large amount of data on the vertical structure of the ocean is now available.We classified this data into five types of vertical structures using unsupervised clustering, a machine learning technique.The characteristics of the vertical structures and the geographic distribution of the data with a specific structure type were consistent with previous research.We demonstrated how the path of the Kuroshio Extension, a strong ocean current flowing in the midlatitude Northwest Pacific Ocean, affects the geographic distribution of data with a specific vertical structure and the mixing of seawater as an example of how this clustering can be used to analyze the spatiotemporal variability of ocean structure.Using this method, it will be possible to clarify the vertical structure of the South Pacific and Indian Oceans, which have received little attention thus far.

Introduction
In the mid-latitude Northwest Pacific Ocean, subtropical and subarctic waters meet, mingle, and mix.This process creates a vertical structure with characteristics that are halfway between the subtropical and subarctic oceans, resulting in a variety of vertical structures in different areas of this ocean region.Because the vertical structure of the oceans has an impact on ecosystems and the exchange of heat and fresh water with the atmosphere, numerous studies have been conducted to define areas with distinctive structures and their boundaries.For example, the north of about 45°N is called the subarctic region, which has a temperature inversion layer (Uda, 1963;Favorite et al., 1976).Belkin et al. (2002) suggested that this region can be divided into a northern part with a clear cold temperature minimum in the subsurface layer and a southern part with a weaker temperature minimum in a slightly deeper part.The east of Honshu up to 155°E is called the Mixed Water region or Kuroshio-Oyashio Interfluence region, where the Kuroshio and the Oyashio approach each other.Both currents change their course to the east while interacting intensively (Favorite et al., 1976).Mesoscale eddies such as Kuroshio warm-core rings and cold-core rings, formed by the detachment of northward and southward meanders, respectively(e.g., Yasuda et al., 1992;Nakano et al., 2013), are distributed in this region.It has been observed that these eddies interact and move (Yasuda, 2003).Offshore of the Mixed Water region is the Transition Domain, which is where subtropical and subarctic waters mix and diffuse (Favorite et al., 1976;Yasuda, 2003;Kida et al., 2015).Favorite et al. (1976) defined the Subarctic Front (SAF) as the position where the temperature reaches 4°C at 100 m depth, and the Subarctic Boundary (SAB) as the position where salinity reaches 34.0‰ from the surface to the subsurface, which is the northern and southern limits of the Transition Domain, respectively.
Many fronts corresponding to the boundaries of these regions and showing steep horizontal gradients in temperature and salinity are also known, and sometimes the same front has been called by different names.Roden et al. (1982) proposed three fronts from north to south: the Polar Front, the Subarctic Front, and the Kuroshio Front.Note that the Subarctic Front in Roden et al. (1982) refers to different things from the Subarctic Front (SAF) defined by Favorite et al. (1976).The Polar Front corresponds to the southern boundary of the subsurface water temperature minimum, the Subarctic Front to the southern boundary of the subarctic low-salt water mass, and the Kuroshio Front to the northern boundary of the Kuroshio Extension(KE).Favorite et al. (1976) define the Polar Front and theSubarctic Fronts as the SAF and Subarctic Boundary (SAB), respectively.Belkin et al. (2002) proposed that the Polar Front is the boundary that separates the northern and southern parts of the subarctic region.Some fronts corresponding to ocean structure boundaries have also been interpreted as boundaries of water mass distribution areas.Among water masses, those with uniform characteristics in the horizontal and vertical directions formed inside or near the top of the permanent pycnocline are called mode waters (Hanawa & Talley, 2001).The typical mode waters in the mid-latitude Northwest Pacific are the Subtropical Mode Water (STMW: Masuzawa, 1969) and the Central Mode Water (CMW: Nakamura, 1996;Suga et al., 1997).Such mode water reflects the temporal variability of oceanic and climatic conditions in the formation area (Hanawa & Talley, 2001).For the STMW in the North Pacific, Suga et al. (2004) found it to exist in a density layer with = 24.8-25.7 kg m −3 in the northwestern part of the subtropical circulation.Oka et al. (2011) also found that the range where most of the data show mixed layer depths deeper than 150 dbar in winter, i.e., = 16.0-21.5°C,S = 34.65-34.95,= 24.2-25.6kg m −3 was assumed to represent the STMW in the North Pacific.Nakamura (1996) showed that in the east of the dateline, CMW is formed north of the 9°C isotherms at a depth of 300 m (T 300 = 9°C) and south of the SAF.On the other hand, Suga et al. (1997) found that an isothermal layer corresponding to the CMW was distributed between the KE (T 300 = 12°C) and the Kuroshio Bifurcation Front (KBF) (T 300 = 8°C) centered at 160°W and that the core temperature was 10°C-13°C.Mecking & Warner (2001) showed that the CMW is divided into a light part ( = 26.0-26.2kg m −3 ) and a dense part ( = 26.2-26.4 kg m −3 ) at 38°N from the CFC-12, potential vorticity, and oxygen cross-section at 179°E.Tsujino & Yasuda (2004) used numerical experiments and showed that the lighter variety of CMW (L-CMW) is formed between KE and KBF, while the denser variety of CMW (D-CMW) is formed between KBF and SAF.The CMWs in Nakamura (1996) and Suga et al. (1997) may refer to D-CMW and L-CMW, respectively.The observations also support the existence of L-CMW and D-CMW by Oka & Suga (2005).In their study, L-CMW was characterized by 12.9 ± 0.9°C and = 25.8-26.2kg m −3 , while D-CMW was characterized by 8.8 ± 1.2°C and = 26.3-26.4 Most of the above studies on the ocean structure in the mid-latitude Northwest Pacific have been conducted somewhat subjectively, with targets and areas set based on the existing knowledge of the specific ocean structure of interest to the researcher, have not been carried out in a way that provides a comprehensive view of the ocean structure without being strongly influenced by specific interests.This is partly because the vertical data needed to capture the ocean structure used to be obtained almost entirely from shipping observations, and thus the number of data was minimal and its spatial and temporal coverage was limited.With the addition of Argo floats to the observation network, a large amount of data on the vertical structure of the ocean is now available.Nonetheless, ocean structure analysis is still carried out in the same manner as before, based on existing knowledge.Sugimoto and Hanawa (2007), for example, defined the mode water formation area as a rectangle based on previous study results and analyzed it within that rectangle.Guo et al. (2018) and Oka (2009) established criteria based on existing knowledge for specific parameters such as potential vorticity and mixed layer depth and then used them to identify the mode water.However, given the availability of more than 2 million Argo float data points globally (Wong et al., 2020), it could be possible to capture the ocean structure using data-driven science, which is only possible with massive and geographically unbiased data.
As a method of applying data-driven science concepts to Argo data, Maze et al. (2017) developed the Profile Classification Model (PCM), which applies unsupervised clustering to vertical profiles of the ocean to classify profile shapes.They found it difficult to see the structure and variability of the heat storage pattern using a simple traditional method such as dividing the region into rectangles, so they applied unsupervised clustering to the Argo data to identify the heat storage pattern.They applied PCM to Argo data for the North Atlantic Ocean and discovered that the distribution of each class is concentrated in a specific area on the map.In other words, it is expected that the PCM will be able to extract the characteristics of ocean structure that differ from region to region.Another useful feature of PCM is that it can use a labeling metric, which measures the robustness of the clustering.Several previous studies have used PCM in the Southern Ocean (Jones et al., 2019;Rosso et al., 2020) or the Amundsen Sea Embayment (Boehme & Rosso, 2021).Houghton & Wilson (2020) clustered Pacific Ocean temperature data by k-means clustering, which is the other unsupervised clustering than PCM.They discovered that the clusters' spatiotemporal variability was strongly associated with El Niño events.However, no studies on the ocean structure of the mid-latitude Northwest Pacific Ocean have been conducted, nor have labeling metrics been used effectively.
The purpose of this study is to apply unsupervised clustering by PCM to temperature and salinity profiles by Argo float in the mid-latitude Northwest Pacific Ocean to compare the characteristics found in the geographic distribution and the vertical structure of temperature and salinity of each class with previous findings and to identify correspondences.We discuss the relationship between the variability of the dynamic state of KE, a significant current in the mid-latitude Northwest Pacific, and the distribution and vertical structure of each class, as well as the labeling metric, a measure of mixing between different classes, based on the spatiotemporal variability of the clustering results.This is an attempt to understand the spatiotemporal variability of ocean structure from PCM results.Section 2 describes the data used in this study.Section 3 describes the specific analysis required to perform PCM and the method used to compare the clustering results with the dynamic state of KE.Section 4 presents the results of the clustering and a discussion of each class.The clustering results are then grouped based on the dynamic state of the KE to examine the temporal variability of the distribution of classes and profiles with low labeling metrics.Chapter 5 provides a summary and describes future challenges and perspectives.

Data
This study used Argo data of potential temperature and salinity obtained in the mid-latitude Northwest Pacific Ocean.The Argo data is based on a Snapshot of Argo as of June 8th, 2019, as provided by the Global Data Assembly Center (Argo, 2019).The temperature and salinity ranges from 30°N to 60°N and from 140°E to 180°.Because the depths at which temperature and salinity were measured differed for each profile, the vertical grid count of the profiles was unified by interpolating the range of depths from 5 m to 1000 m every 5 m using the Akima spline (Akima, 1970).To avoid overshoot during interpolation, only profiles with vertical intervals of 50 dbar or less at 400 dbar or less and 100 dbar or less at 1000 dbar or less were used.This condition is the same as that set by the MOAA GPV (Hosoda et al., 2008), which is also complemented by the Akima spline.Both Real-time and Delayed data modes were used.Since the purpose of this study is to capture the characteristics of the average ocean structure, it was considered acceptable to use the Real-time mode as long as the QC was passed.Only QC flags of 1 (Good data) for pressure, temperature, and salinity between 5 m and 1000 m depth were used for Real-time, and 1 (Good data), 2 (Probably good data), 5 (Value changed), and 8 (Estimated value) were used for Delayed.As a result, the number of data used was 56414.The annual distribution of the number of data used is shown in Fig. 1.In this study, clustering was used to classify all the data without separating them by season or year.Fig. 2 depicts the geographic distribution of the data as the number of data points in a 1°grid.While there are more than 200 data points in a 1 °grid in the ocean around Japan, some areas in the northern part of the ocean have less than 50 data points.In this study, the machine learns from the data to determine how to configure the class characteristics to best classify the data.As a result, it is easy to predict that if model learning is performed while accounting for geographic bias in the data, the learning results may significantly reflect information from a specific region.In other words, the dataset used for model learning must have a uniform geographic distribution.
In this study, we randomly extracted 5 data points per 1°grid because most regions where data exist within the analysis range have at least 5 data points per 1°grid.If the number of data points in a degree grid was less than 4, we extracted all of them.As a result, we created a training dataset from a total of 4544 data points.
3 Materials and Methods

Gaussian Mixture Model
In this study, we use PCM, a method developed by Maze et al. (2017) described in Section 1. PCM uses an unsupervised learning method based on the Gaussian Mixture Model(GMM) as a method for clustering ocean profiles.In this section, we briefly describe the theory of the GMM.
GMM is a model that reproduces the probability density function by adding weighted normal distributions together.Here one normal distribution is described as follows: where x ∈ ℝ ×1 is a vector representing the profiles in the dataset X ∈ ℝ × to be analyzed. is the number of profile dimension, and N is the number of profiles in the dataset.∈ ℝ ×1 and Σ ∈ ℝ × are the mean (vector) and covariance (matrix), respectively.Eq. ( 1) is given as a scalar for a single profile, and the entire dataset,  (x; , Σ) has the ℝ 1× value.Using this normal distribution, the probability density function (X) by GMM can be expressed as follows: where  denotes the number of classes for clustering, for this K, we give an arbitrary value before training the model.  represents the weight for each normal distribution and ∑  =1   = 1.  can also be taken as the prior probability that the class of a profile is k, i.e., (c = ).The weights, mean, and covariance of the normal distribution have unique values for each class.In other words, in GMM, a class is represented as a normal distribution with specific three parameters: a weight   , a mean  , and covariance Σ  .The learning process of GMM is to learn these parameters so that the probability density function of the actual data set is approximated by Eq. ( 2) and to determine the shape of the normal distribution that represents the class.The Expectation-Maximization algorithm (Dempster et al., 1977) is used to learn the parameters, but the details are not described here.
When clustering each profile from the trained model, we use the posterior probability (c = |x).By calculating the posterior probability for a profile x for all classes, we can determine the probability that the profile belongs to each class.The posterior probability has a value of 0-1 and adds up to 1 for all k.So, for example, if the profile has purely Class 2 features, then  (c = 2|x) = 1 and  (c ≠ 2|x) = 0. Conversely, if the profile has features from all classes equally, then  (c =|x) = 1/ for all classes.We label the profile as belonging to the class where the posterior probability is the largest.
Since we can obtain the posterior probability for each class, we can calculate the labeling metric, which is a measure of the robustness of the classification of each profile, as follows: where   is the highest posterior probability used for labeling.The labeling metric takes values from 0 to 1, and since   = 1 for only certain classes of features, as in the previous example, the labeling metric is also 1.If the profile has features of all classes equally, the labeling metric will be 0 because   = 1/.Maze et al. (2017) corresponded the labeling metric and clustering robustness as Virtually certain for 0.99-1, Very Likely for 0.9-0.99,Likely for 0.66-0.9,As likely as not for 0.33-0.66,and Unlikely for 0-0.33.

Principal Component Analysis
The profiles used in the clustering analysis of this study have a dimension of 400 layers because they are interpolated vertically from 5 m to 1000 m at 5 m intervals for temperature and salinity.As a result, the number of combinations required to represent the covariance matrix with dimensions of D × D is far greater than the number of profiles in the training dataset.This means that the number of profiles is insufficient to learn the covariance matrix, resulting in a decrease in learning accuracy.Thus, in machine learning, including GMM, it is known that the accuracy of learning becomes worse if the number of dimensions of the data used for learning is too large.This problem is known as the "curseof-dimensionality." To avoid this problem, it is known that reducing the number of dimensions by Principal Component Analysis(PCA) is an effective method (Verleysen & François, 2005;Wang & Sloan, 2007).PCA transforms a D-dimensional dataset X into a d-dimensional dataset y ∈ ℝ × as follows: where P(, ) is a D × d-dimensional matrix, and each column of P is an eigenvector of X ⊤ X.Through this P, the data is transformed from a space of D dimensions to a space of d principal components.The principal components are, in order of increasing variance of the data, the first principal component, followed by the second principal component, and so on.This corresponds to the order of increasing eigenvalues for the eigenvectors of X ⊤ X.
As in Maze et al. (2017), to determine how many principal components to use, we used the centered root mean squared difference (CRMSD) between the data before dimensionality reduction and the data restored to the original number of dimensions after dimensionality reduction.CRMSD is calculated as follows: where x  () is the data before dimensionality reduction and   () is the data restored after dimensionality reduction.2017) adopted d = 11 as the number of principal components for which the CRMSD of temperature is less than 0.5°C over all layers and less than 0.1°C below 600 m depth.In this study, salinity is used in addition to temperature, and the variation of salinity is about 1/10 of that of temperature.Using the same scale criteria as Maze et al. (2017), the CRMSD for all layers would be less than 0.5°C for temperature and less than 0.05 for salinity, and the CRMSD for depths below 600 m would be less than 0.1°C for temperature and less than 0.01 for salinity.However, even when the number of principal components was increased to d = 13, the CRMSD of salinity remained greater than 0.01 below 600 m depth, as shown in Fig. 3. On the other hand, the cumulative contribution rate is high enough, more than 99.5%, and we can conclude that there is little advantage in increasing the number of principal components further.In this study, we judged that the condition below 600 m depth is unnecessary.Therefore, the optimal number of principal components was considered to be d = 9, where the CRMSD of all layers was less than 0.5°C for water temperature and less than 0.05 for salinity, for a total of 18 principal components.

Deciding the number of class
As described in 3.1.1,GMM allows the number of classes, K, to be given as desired by the calculator.In other words, the dataset does not constrain the number of classes, and clustering can be performed with an arbitrary number of classes.As a result, it is necessary to determine the most appropriate number of classes for the dataset to be clustered.In this study, as in previous studies using PCM, including Maze et al. (2017), the Bayesian Information Criterion (BIC: Schwarz, 1978) is used to derive the statistically appropriate number of classes.The BIC is given as follows: where ℒ() is the log-likelihood when trained on K classes, and  is the number of profiles to be trained.  () = ( − 1) +  + ( − 1)/2 is the number of independent variables corresponding to , , and Σ in order from the first term.The number of classes K for which the BIC is minimized is considered the optimal number of classes.
To obtain the number of Argo float profiles independent of each other, we refer to the scale presented in Ninove et al. (2016), where the profiles are horizontally independent.According to Ninove et al. (2016), for the Argo float profile, the scale at which the profiles are horizontally independent for both temperature and salinity is 110-130 km in the Northwest Pacific.On the other hand, the area of the present analysis, excluding the area where there is no data, is approximately 9,000,000 km 2 .Assuming that the horizontal correlation scale is 120 km, i.e., there is one independent profile every 120 km × 120 km = 14400 km 2 , the number of independent profiles in the analysis area is estimated to be 625.Therefore, we randomly selected 625 profiles from the training dataset and performed the BIC calculation 50 times.The BIC calculated in the above manner is shown in Fig. 4. Since K can be minimized in the range of 3 to 6, clustering trials were conducted for K = 3, 4, 5, and 6, respectively.However, because the EM algorithm's initial values are random, clustering by GMM may produce results with an initial value dependency.We compared the clustering of 100 initial values for each class number to assess the dependence on the initial value.As a result, for K = 5, almost all of the 100 clustering results were identical.Therefore, we judged that we could obtain results without initial value dependency if the number of classes was 5 and analyzed the clustering using the model with K = 5.

Grouping by KE index
As described in the introduction, the location of ocean structure boundaries and fronts are known to change spatially with time (Belkin et al., 2002;Suga et al., 2003;Qiu & Chen, 2005).Therefore, it is easy to imagine that the spatial distribution of the vertical profiles classified into each class will show spatiotemporal variations.A typical phenomenon that is considered to affect the ocean structure mainly in the southern part of the study area is the stability and instability of the dynamic state of KE (Qiu & Chen, 2005, 2010)In the mean-field, the KE flows eastward around 35°N.It flows with meandering that changes with time at times.In such cases, the dynamic state of KE is considered unstable.On the contrary, KE's dynamic state is considered stable when it does not meander significantly and follows a stable path.The flow velocity of KE increases and recirculation is enhanced in the stable period, while the regional eddy kinetic energy increases in the unstable period (Qiu & Chen, 2005, 2010;Taguchi et al., 2007).Qiu et al. (2014) defined the Kuroshio Extension index (KE index) as the average of four anomalies: the flow length upstream of the KE (141°E to 153°E), the variation in sea level across the KE, the mean latitude of the KE location, and the strength of the KE recirculation.They proposed that the dynamic state of KE can be quantitatively determined by using the KE index, which is stable when the KE index is positive and unstable when it is negative.2006,2007,2008 In this study, we investigate the change of clustering results with the change of the dynamic state of KE.The KE index may take positive or negative values for a substantial period continuously or switch between positive and negative values in a short period (see Fig. 4(a) in Qiu et al., 2020).On the other hand, some period is needed to describe the characteristics of the clustering results.
To see how the KE index variation corresponded to the clustering results, we average the KE index per year for the period 2003-2018, when the number of profiles in the analysis range was more than 1000, and classify each year into five groups according to its value.Years with a mean KE index of 0.6 or higher were classified as "High," 0.2-0.6 as "Middle-high," −0.2-0.2 as "Middle," −0.6-0.2 as "Middle-low" and −0.6 or lower as "Low."Fig. 5 depicts the KE index calculated by Qiu et al. (2020) and its grouping from 2003 to 2018, and Table 1 shows the correspondence between each group and year.In almost all periods, the KE index is positive or negative in the High and Low groups.The geographic distribution of the Argo profiles and the classes to which they belong are shown in Fig. 6.The distribution of the percentage of profiles of the most abundant class in each 1°grid is shown in Fig. 7.We can see from these figures that each of the five classes is clustered in a specific area, and the regions characterized by each class are emerging.We extracted the profiles by month or season to check the geographic distribution and discovered that the geographic locations of the five classes were generally the same, with no significant difference in the distribution.As a result, this clustering is unaffected by seasonal changes and instead reflects primarily geographical characteristics.The characteristics of each class were then compared to previous findings on the characteristics of ocean structure in specific regions.Here, it is noteworthy that the robustness of the classification is quantitatively demonstrated at the boundaries of the regions classified by the PCM results (Fig. 7).This is an advantage of the PCM compared to the conventional classification of regions.The vertical structures of temperature and salinity of the profiles belonging to each class are shown in Fig. 8, and this information is plotted on the TS diagram in Fig. 9.Because this study's clustering is based on a GMM, the profiles in each class are expected to be distributed in a manner similar to a normal distribution.As a result, the mean and two standard deviations for each class are expected to be roughly the same as the median and the 5th and 95th percentile.Indeed, we found that the mean and two standard deviations were nearly identical to the median and the 5th and 95th percentiles.As a result, the temperature and salinity structure are discussed in this study using the median, the 5th and 95th percentiles, which can be safely considered to represent the mean and two standard deviations, respectively.The median and percentile are calculated for each of the 200 layers, and these are plotted in the figure by connecting the values for each layer.
According to previous knowledge (e.g., Favorite et al., 1976;Yasuda 2003), water with three characteristics, i.e., low-temperature and low-salinity subarctic water, high-temperature and high-salinity subtropical water, and in-between, are distributed in the analysis area of this study.In the present result, Class 1, Class 4 and 5, and Class 2 and 3 correspond to these three characteristics, respectively.The overall trend is for the percentile range to increase closer to sea level, which may reflect the seasonal variation of each class.
From here on, we will describe the characteristics of each class in detail.Class 1 is mainly distributed north of 45°N.It is colder and fresher than the other classes and has characteristics of subarctic water.In addition, a feature not seen in other classes is the temperature minimum below 4°C found in the subsurface layer around 100 m depth.This is a characteristic of pure subarctic water located in the northern part of the subarctic region, as Belkin et al. (2002) discussed, i.e., the northern part of the Polar Front or SAF.
Class 2 is mainly distributed between 35°and 45°N and west of 160°E.Although the distribution of Class 2 per 1°grid is the highest west of 150°E, its abundance ratio is not very high due to the presence of Class 3 in the mix (Fig. 7).The percentile range is wider than in the subtropical and subarctic classes, indicating higher intraclass variability.According to the TS diagram, the structure is subtropical from the sea surface to a certain depth in the subsurface layer and subarctic in deeper layers.As a result, Class 2 water has a structure in which subtropical water from the south and subarctic water from the north are stacked on top of one another while retaining their respective characteristics.These are consistent with the Mixed Water region's characteristics.The Mixed Water region is thought to be west of 155°E, relatively close to Japan among the intermediate zones between the subtropical and subarctic zones (e.g., Yasuda 2003).This is consistent with the geographic distribution of Class 2. The distribution of Class 2 would display the spatial distribution of vertical profiles characteristic of the Mixed Water region, along with their relative abundance.
Class 3 is distributed zonally from 35°to 45°N, with Class 2 mixed in the west of 160°E.In addition, as in Class 2, the percentile range is relatively large.On the TS diagram, the upper layer structure is subtropical, and the lower layer is subarctic, which is more moderate than in Class 2. Focusing on the vertical structure of temperature, the percentile on the low-temperature side has a weaker minimum than Class 1, around 100 m depth.This is consistent with the features found within the Transition Domain.On the other hand, the median of salinity in Class 3 shows a salinity minimum of around 200 m depth.The salinity minimum is a feature found south of the Transition Domain and south of the SAB (Favorite et al., 1976;Yasuda, 2003).Within the percentile range, the temperature at = 26.3kg m −3 is around 9°C, which is generally consistent with the characteristics of D-CMW.Class 3 is mainly distributed in the range from 40°N to 45°N in the east of 160°E, consistent with the distribution of D-CMW between KBF and SAF (Tsujino & Yasuda, 2004;Oka & Suga, 2005).As a result, Class 3 implies that at least two features, Transition Domain and D-CMW, are classified in the same class.These features are expected to be separated by extracting only the profiles classified as Class 3 and clustering them by learning a new model.There are minor differences in the vertical structure of temperature and salinity between Class 2 and Class 3, but they are not as noticeable as in the other classes.It is difficult to distinguish them based solely on temperature and salinity data.However, their distribution on the map differs.It is fascinating to note that the differences between the classes are most clearly visible in the geographic distribution, despite the fact that the clustering was done using only temperature and salinity without any information on latitude and longitude.More research is required to determine the reasons for this disparity.
Class 4 is mainly distributed south of 35°N and is mixed with Class 5 between 160°E and 170°E.Compared to Class 5, the upper layer is warmer and saltier and has characteristics of subtropical water more strongly.The temperature is 17°C-18°C from 150 m to 300 m depth, which is consistent with the STMW characteristics.It is also known that STMW is primarily distributed in the northwestern part of the subtropical circulation (Bingham, 1992;Suga et al., 2004).As a result, in terms of characteristics and geographic distribution, Class 4 is a class of profiles with STMW characteristics.
Class 5 is mainly distributed south of 40°N, with increasing distribution offshore and mixed with Class 4 from 160°E to 170°E.This distribution is downstream of KE.The percentile range is wider than that of Class 4 and remains constant to about 600 m depth.Class 5 reflects the complex structure of the water transported by the KE as it mixes with the surrounding water.There is also a part in the percentile where the temperature is about 13°C at around = 26.0kg m −3 , consistent with a feature known as L-CMW (Oka & Suga, 2005;Oka et al., 2011).In addition, it is known that L-CMW is mainly distributed between KE and KBF (Tsujino & Yasuda, 2004;Oka & Suga, 2005), and class 5 represents L-CMW in terms of its distribution area.
These results indicate that each of the classes classified by our clustering represents a geographically unique ocean structure.As a result, each class can be associated with an already known geographic region.Classes 1, 2, 4, and 5 correspond to the subarctic, Mixed Water, northwest of the subtropical gyre, and downstream of the KE, respectively.Because the distribution and characteristics of the Class 3 region differ from those of the previously known regions and cannot be simply associated with them, we refer to the Class 3 distribution region as the subtropical-subarctic intermediate region.

Comparing clustering results with KE index
In this section, we discuss the variability of the clustering results through the grouping by KE index described in Section 3.2.A. Class 4

Change of distribution at Subtropical region
Focusing on the area around 35°N and 145°E, i.e., the northern part of the northwest of the subtropical gyre, we can see that the High group is dominated by Class 4 and has a large proportion of Class 4 in the 1°grid.Class 4 is also distributed in the Middle-high and Middle groups, but its proportion in the 1°grid is decreasing.The distribution of Class 4 almost vanished in the Low group, while the distribution of Class 2 increased more than in the other groups.
In other words, as the KE index decreases, the distribution of Class 4 in the northern part of the northwest of the subtropical gyre decreases.Class 4 is also distributed in the Middle-low group, and there is a grid with a high ratio of Class 4, but the number of profiles in this group is small, and the distribution is sparse, so we cannot compare them.
We now turn our attention to the area between 30°and 35°N and 150°and 165°E, i.e., the boundary between the northwest of the subtropical gyre and the downstream of the KE.In the High group, the distribution area of Class 4 becomes narrower as it moves eastward, being replaced by Class 5 in the south and Class 3 in the north, and Class 5 is more abundant than Class 4 at 165°E.Grids with a high proportion of Class 4 are mostly found between 33°a nd 34°N.There are several grids in the Middle-high group where class 5 is the most abundant, surrounded by grids where Class 4 is the most abundant.Furthermore, the distribution of Class 4 is more southerly than that of the High Group.The Middle group has a southerly distribution of Class 4, and grids with a high proportion of Class 4 are mostly found at 32°and 33°N.The Middle-low group is not discussed because the profile is not distributed evenly across the analysis area.The northern portion of the Low group is almost entirely occupied by Class 3 and Class 5, while Class 4 is concentrated in the southern part.In other words, the shape of the Class 4 distribution area becomes more southerly as the KE index decreases.
In summary, it can be said that the distribution of Class 4 becomes longer in the east-west direction and moves southward as the KE index becomes small.It is well known that the mean latitude of the KE is more northerly when the dynamic state is stable and more southerly when it is unstable (Qiu & Chen, 2005).Because the KE limits the northern limit of the STMW formation area (Hanawa, 1987;Bingham, 1992), the dynamic state of KE influences the northern limit of the Class 4 distribution area.Therefore, it is reasonable to assume that the change in the distribution of Class 4 reflects the relationship between the flow path of KE and the distribution of STMW.Next, to verify whether this future of the Class 4 distribution can be seen on a yearly scale.We divided each group's profile for each year and compared its appearance with that of the whole group.Fig. 11 depicts the distribution of 10145 profiles from the High group, as well as the distributions for 2003, 2004, 2010, 2011, and 2018.Class 4 was indeed distributed in the northern part of the northwest of the subtropical gyre in 2004 and 2011, as was the entire High group.Although 2003 and 2018 do not form a distinct region due to the small number of profiles in the northern part of the subtropical gyre's northwest, some Class 4 profiles are present.In 2010, the profiles were distributed as well as in 2011, but most of the profiles in the northern part of the northwest of the subtropical gyre were Class 2. For the boundary between the northwest of the subtropical gyre and the downstream of the KE, Class 5 was slightly more abundant in the southwestern part in 2010.However, in 2018, Class 4 was more abundant in the same position, indicating that Class 4 has become the majority in the entire group.The characteristics of the High group remained consistent year after year, but differences were observed, such as in the northern part of the northwest subtropical gyre in 2010.Each year's distributions for the Middlehigh, Middle, and Low groups, for which a sufficient number of profiles were obtained, were similar to those of the group characteristics.

B. Class 5
In the downstream of KE, where Class 5 is mainly distributed, the distribution of Class 3 and Class 4 varies depending on the group, as shown in Fig. 10.In the High group, there are many light orange grids, and there are also green and purple grids indicating that Class 3 and 4 are the most common.This shows that many classes other than Class 5 are mixed in the distribution.In the Middle-high group, light orange and purple grids are also observed, but less in number than in the High group.In the Middle and Low groups, most of the profiles in the downstream area of the KE are Class 5. Qiu & Chen (2011) showed that the eddy kinetic energy downstream of the KE (152-165°E) is large when the KE flow path is stable and small when unstable.It is proposed that the mixed class distribution in the stable period and the almost entirely Class 5 distribution in the unstable period correspond to the eddy kinetic energy.In other words, the higher the KE index, the more Class 4 and Class 3 are mixed in with Class 5. Furthermore, we investigated whether the distribution was the same when divided into each year and discovered that it was similar.However, it should be noted that many areas cannot be discussed due to the lack of profiles for the downstream of KE (see Fig. 11).

Relation between labeling metric and KE index
We computed the labeling metric, a measure that quantifies the robustness of the classification, for all profiles as described at the end of Section 3.1.1and examined the distribution of profiles with low labeling metrics for each group.The Low Labeling metric Profile (LLP) was defined in this study as a profile with a labeling metric of less than 0.9, which Maze et al. (2017) determined to be less than Likely.The strength of mesoscale eddy activity across the KE is also related to the KE's stability and instability.When the KE is unstable, mesoscale eddy activity increases and the exchange and mixing of water across the KE becomes active.In other words, water with different oceanic structures will mix to create new waters.Profiles with new characteristics cannot be assigned to a particular class on a one-to-one basis, and the labeling metric will become low.Therefore, as the KE index becomes low and the KE flow becomes unstable, the LLP is expected to increase.
Figure 12.Vertical profiles with a labeling metric less than 0.66 are shown as thin lines.The color of the thin line is the color of the class with the secondhighest posterior probability.The thick line corresponding to the class color is the average of the profiles with a labeling metric higher than 0.99.
Before investigating the relationship between LLP and KE, we will look at the vertical profile to see what characteristics the LLP has.Fig. 12 shows a vertical profile with a labeling metric less than 0.66, which is particularly low among LLPs.In this figure, using Class 5 as an example, the profiles with higher posterior probability in Class 3 are cooler and have lower salinity, whereas the profiles with higher posterior probability in Class 4 are warmer and have higher salinity, compared to the average of the profiles with higher labeling metrics, which is shown by the bold orange line.In other words, the majority of LLPs in Class 5 have a profile that appears to be a hybrid of Class 5 and Class 3 and Class 4 characteristics.Similarly, for the other classes, the vertical profile is closer to the class with the second-highest posterior probability.However, the LLPs of Class 2 with high posterior probability in Class 3 and the LLPs of Class 3 with high posterior probability in Class 2 did not show a significant difference from the profiles with high labeling metrics.As discussed in Section 4.1, both Class 2 and Class 3 have a wide range of 5,95th percentile, and there is no significant difference between them on the vertical profile in Fig. 8. Therefore, even if Class 2 and Class 3 are mixed, and the labeling metric becomes low, it may not be clearly seen on the vertical profile of temperature and salinity.Next, we examined the relationship between the KE index and the proportion of LLPs.Fig. 13 shows the KE index and the LLP percentage deviation from the mean.The correlation coefficient between these values was r = −0.597,p < 0.1 with 8 degrees of freedom.Therefore, we can say that there is a significant negative correlation between the KE index and the percentage of LLP.This result is consistent with what would be expected from the characteristics of the KE dynamic state.However, in some years, such as 2010 and 2011, no correspondence was observed.The first reason for this is the geographical scope of KE's influence on ocean structure.The KE is in the southern part of the analysis area and is not expected to have an impact on the subarctic region.In addition, several phenomena other than KE are thought to influence ocean structure.Because all of the profiles within the analysis range were examined in this study, changes in the labeling metric caused by phenomena other than KE are likely to appear in the results.In addition, considering the time scale of mixing, it is expected to take a certain amount of time for changes in the dynamic state of the KE to affect the vertical structure of the ocean.Therefore, it is necessary to take this point into account in the analysis.The distribution of LLPs in each group of KE index is shown in Fig. 14.The labeling metric tended to be lower near the boundary of the class distribution area.The LLPs of all groups are concentrated in a zonal distribution in a southeasterly direction from off the southeastern coast of Hokkaido to around 160°E in this figure.This distribution roughly corresponds to the boundary between the Mixed Water region and the subtropical-subarctic intermediate region, as well as the northwest of the subtropical gyre and downstream of the KE.The percentage of LLPs per 1°grid is higher at the region's boundary than elsewhere.In Section 4.2.1, it was discovered that the high KE index group contained many classes other than Class 5 downstream of the KE, and similarly, grids with a high percentage of LLPs per 1°grid were more frequently found in the high KE index group.On the contrary, in the Low group, the grids with more than 10 profiles but no LLPs were often found downstream of the KE.

Conclusions
In this study, unsupervised clustering using PCM was applied to vertical profiles of temperature and salinity from Argo floats to identify the ocean structure in the mid-latitude Northwest Pacific Ocean and investigate its distribution and characteristics.Clustering using a model trained with the number of classes set to 5 revealed that the distribution of each class formed a distinct region.Each class's ocean structure is consistent with the ocean structure that is characteristic of each region, and each class can be associated with known ocean regions.Classes 1, 2, 4, and 5 correspond to the subarctic region, the Mixed Water region, the northwest of the subtropical gyre, and the downstream of the KE, respectively, while Class 3 is not simply related to a single known region and is termed as the subtropical-subarctic intermediate region.In addition, PCM provides richer information than conventional simple domain classification because it can show the relative distributional abundance in a region where profiles classified into each class are mixed.Therefore, we can say that this study shows the usefulness of machine learning in the field of physical oceanography.One point to note about the present analysis is that the classification of Class 2 and Class 3 requires further examination to determine the differences in their characteristics.In particular, Class 3 may contain ocean structures with several different characteristics, which will require further analysis and consideration.
To examine the temporal variation of the clustering results associated with the KE pathway variability, each year was grouped based on the value of the KE index, which is a measure of the stability or instability of the KE pathway.When we described the characteristics of the distribution of the classes and the labeling metric for each group, various characteristics were identified for each group.In terms of class distribution, the shape of the distribution area of Class 4 and the distribution of Classes 3 and 4 downstream of the KE changed with the value of the KE index.These variations are related to the distribution of STMW and the degree of eddy kinetic energy, respectively.A significant negative correlation was found for the LLP, where the percentage of LLP was lower when the KE index was high, and higher when the KE index was low.It indicates that mixing occurs actively when the KE is unstable.The distributions were found to extend southeastward from off the southeastern coast of Hokkaido, overlapping the boundaries of the class distribution areas.This reflects the active mixing that occurs at the geographic boundary of different ocean structures.These results indicate that machine learning is one of the most valuable methods to understand the response of ocean structure to variability phenomena.
For future work, the analysis conducted in this study needs to be interpreted from a more statistical perspective.In this study, we performed PCA to reduce the number of dimensions in clustering and found that nine dimensions for temperature and salinity, respectively, were sufficient to represent the ocean structure in the mid-latitude Northwest Pacific.The results show that the nine temperature and salinity dimensions are adequate to describe the ocean structure in the mid-latitude Northwest Pacific Ocean.It is beneficial to investigate how each of these nine dimensions explains changes in the vertical structure, not only for clustering but also for describing the vertical structure of this region in greater detail.Each class was represented in this paper as a vertical structure of temperature and salinity, with the differences in this structure discussed.However, as stated in 3.1.1,in GMM clustering, each class is represented in the field of the probability density function by a weighted normal distribution.By looking at the results of this study in the field of the probability density function, it is expected that we can see differences among the classes that are not seen in the fields of temperature and salinity.This would be also useful to understand what differentiates Class 2 and Class 3, which was not clarified in this study.
The grouping by KE index, which was done to verify the spatiotemporal variability of clustering, also can be improved.In this study, the grouping was done based on the one-year average of the KE index.However, a grouping method that takes the physical state into account more is required.Separation can be made, for example, between the period just after the KE state changes from stable to unstable and the period after that when the instability persists.Furthermore, while the correlation coefficients of simultaneous correlations were calculated in this study, lag correlations should be investigated in light of the time scale of mixing.It would be also helpful to discuss the extent to which changes in the KE index affect LLP geographically, based on correlation coefficients by location.
The unsupervised clustering analysis of ocean structure can be extended to other physical phenomena and ocean areas.In this study, we discussed the relationship between the dynamic state of KE and the clustering results.However, as mentioned in Section 4.2.2,KE is located to the south of the study's analysis area, and its impact on the subarctic region where Class 1 is distributed is thought to be very limited.In this study, there was no statistically significant change in the distribution of class 1 by KE index.In the future, we would like to investigate the relationship between the clustering results and other phenomena such as the Oyashio and Aleutian Low, which are expected to have a significant impact on the subarctic region.Many studies have been conducted on the mid-latitude Northwest Pacific Ocean, and this study took the form of a comparison with existing knowledge.Applying the same method to areas that have not been sufficiently studied, such as the South Pacific Ocean, may provide new insights into ocean structure.
In this study, PCM was used to classify the ocean structure, but PCM is still a new method, and there is still much potential for the development of this method.In this study, for example, a profile with a low labeling metric is considered to have been altered by mixing with other classes.A high labeling metric profile, on the other hand, can be considered to have not been mixed with other classes and to have retained its original characteristics.Using this concept, we can consider that a profile with a high labeling metric, even if it is distributed in a region dominated by other classes, represents water that has been transported from its original distribution area by eddies or other means while retaining its characteristics.Thus, the application of PCM is expected to provide a new way of viewing various ocean-related problems in the future.

Figure 1 .
Figure 1.Annual distribution of the number of Argo float profiles classified in this study.

Figure 2 .
Figure 2. The number of distributions of Argo float profiles per 1°grid in the study area.

Figure 3 .
Figure 3. CRMSD between the original data and the restored data.d = 0 represents the standard deviation of each layer.The cumulative contribution ratio is shown in brackets.The thick vertical lines indicate the positions of 0.5 and 0.1 for temperature and 0.05 and 0.01 for salinity.The results of the CRMSD calculation for the dataset used in this study are shown in Fig. 3. Maze et al. (2017) adopted d = 11 as the number of principal components for which the CRMSD of temperature is less than 0.5°C over all layers and less than 0.1°C below 600 m depth.In this study, salinity is used in addition to temperature, and the variation of salinity is about 1/10 of that of temperature.Using the same scale criteria asMaze et al. (2017), the CRMSD for all layers would be less than 0.5°C for temperature and less than 0.05 for salinity, and the CRMSD for depths below 600 m would be less than 0.1°C for temperature and less than 0.01 for salinity.However, even when the number of principal components was increased to d = 13, the CRMSD of salinity remained

Figure 4 .
Figure 4. Results of 50 trials of the BIC calculation.The blue line indicates the mean, and the error bars indicate one standard deviation.

Figure 5 .
Figure 5. KE index from 2003 to 2018 calculated by Qiu et al. (2020).The rectangles represent the average of the KE index values per year, and the colors represent the groups.Dark red is High, light red is Middle-high, purple is Middle, light blue is Middle-low, dark blue is Low.The dotted lines indicate 0.6, 0.2, 0, −0.2, and −0.6 from top to bottom.

Figure 6 .
Figure 6.Distribution of classified profiles.The color indicates the class to which each profile belongs.

Figure 7 .
Figure 7. Classification by the class most abundantly distributed in a 1°grid.The shade of the color indicates the percentage of the class in the grid.The darkest color represents 100%, and white represents 0%.

Figure 8 .
Figure 8. Vertical structure of temperature (upper) and salinity (lower) of the profiles classified into each class.The bold line indicates the median, and the dotted lines indicate the 5th and 95th percentiles.The number of profiles classified into each class is shown in brackets.

Figure 9 .
Figure 9. TS diagram of the median of the profiles classified into each class.

Figure 10 .
Figure 10.The distribution of the classes in each group is shown as a 1°grid.The number in the title indicates the number of profiles in each group.Other details are the same as in Fig. 7. First, we focused on distributing the profiles belonging to each class and examined the changes associated with the KE index.Fig. 10 shows the distribution of profiles per 1°grid in each group.From this figure, we can see an evident change between the groups, particularly in Class 4 and Class 5.In the following, we will examine the changes in Class 4 and Class 5 in detail.

Figure 11 .
Figure 11.Distribution of profiles for the entire High group and the years 2003, 2004, 2010, 2011, and 2018 belonging to the High group.The number

Figure 13 .
Figure 13.The time series of the percentage of LLP (green) is superimposed on Fig. 5.The percentage of LLP is shown in the form of deviation from the mean from 2003 to 2018.

Figure 14 .
Figure14.Distribution of profiles with labeling metric less than 0.9.The gray grid represents the grid with more than 10 profiles in 1°grid, and the gray shading indicates the percentage of LLPs.

Table 1 .
Groups by KE Index.