Marine Dissolved Oxygen Prediction With Tree Tuned Deep Neural Network

Changes in the dissolved oxygen concentration of the ocean have important implications for marine ecosystems and global climate change. However, limited by measurement techniques, the hydrology data is not always complete. Thus, accurate prediction on marine dissolved oxygen concentration (MDOC), is a powerful supplement to the current observation data. Deep neural network is a powerful model to do the prediction, while it is usually difficult and time-consuming to tune its structure. Meanwhile, deep jointly informed neural network (DJINN) provides a user friendly method to tune the structure of neural networks. In this article, a deep learning-based model called marine deep jointly informed neural network (M-DJINN), is proposed to predict MDOC. M-DJINN improves DJINN performance via initializing the weights of neural network using a zero-mean Gaussian distribution with a variance related to the number of neurons in the neighbor layer. In M-DJINN, to get an ideal deep neural network structure, users only need to tune the tree number and max tree depth. While predicting MDOC on World Ocean Database 2013(WOD13) data with M-DJINN, the novel model proves better than DJINN on both accuracy and convergency. When the max tree depth is set to 10, the mean squared error (MSE) is reduced by 17.6% compared with DJINN. It can be concluded from the experiments that with enough training data, the performance of M-DJINN may keep improving by deepening the neural networks.


I. INTRODUCTION
Marine dissolved oxygen concentration (MDOC) is an important parameter in marine chemistry, and one of the most important indicators in seawater quality on the same time. Thus, the oxygen distribution in the world oceans reflects the conditions of marine organisms, such as dead zones [1]. Covering three-quarters of the earth, the ocean plays an important role in the regulation of the global climate. So MDOC is also considered to reflect global warming events [2]. Moreover, oxygen is an important tracer of ocean ventilation and interior ocean circulation. Therefore, the study of MDOC is of great value in the field of oceanography and can help in tracing changes of the geophysical and geochemical systems.
Comprehensive research on MDOC needs substantial data. Many sensors and floats, as the instruments used for making subsurface measurements in the ocean, have been deployed The associate editor coordinating the review of this manuscript and approving it for publication was Xiaochun Cheng.
to observe marine dissolved oxygen and other variables in the ocean in some large-scale global observation programs. For example, the World's Oceans Real-time Network Plan (ARGO) operates and manages more than 3,000 floats distributed in all oceans [3]. Aside from floats, employing seagoing research ships to take bottled water samples is also an alternative way to obtain data. To reorganize the data from different sources, the World Ocean Database (WOD), another ocean-observation program, merges thousands of originators data sets from many different countries and organizations [4]. However, not all the floats in these programs measure MDOC, especially for early floats which only measure temperature and salinity. This leads to fewer samples on the dissolved oxygen concentration, compared to the samples on temperature and salinity. Fig. 1 shows the proportion of total samples on several variables in WOD. New floats with additional items for oxygen observation still have problems such as poor factory calibration [5], and may drift while exposed to light or heat [6]. For bottled water samples, the problems above also exist. Hence, it is meaningful to find the relationship between dissolved oxygen and other variables in the ocean and predict the dissolved oxygen concentration based on existing hydrology data.
Traditionally, methods with climate system model and low-order oceanic biogeochemical model are employed to predict the dissolved oxygen concentration [7]. Mathematical modeling is another common method to predict the dissolved oxygen concentration [8]. As the scale of ocean data increases, modern machine learning algorithms with high computation speeds and less assumptions about the data show the edge over traditional methods on robustness [9] and have a great potential for modeling highly nonlinear relationships between the variables. Some machine learning algorithms have been used to learn hydrological data. For example, an artificial neural network (ANN) was employed to dissolved oxygen (DO) concentrations of the canals in Bangkok [10], random forest algorithm was used to predict DO in the southern ocean [11], and support vector regression was used to predict DO in Wen-Rui Tang river [12]. However, deep learning methods have not been applied to predict MDOC widely. A possible obstacle to promote deep learning in this area is the tuning of structures in deep learning models, which requires experience and skills in this filed. And for researchers who are not familiar with this deep learning models, this tuning process may be both difficult and time-consuming.
Among the common machine learning algorithms, decision tree-based models stand out due to their robust accuracy and relatively few hyperparameters to tune [13]. Nevertheless, its performance is at the cost of limited accuracy and high memory demands. In the contrast, another frequently used machine learning model, deep neural network (DNN) are scalable to large volumes of high-dimensional data [14]. However, it is difficult to design the structure of DNN. To get a good structure, hyperparameters of DNN, including the number of layers and the number of nodes in a layer are required to be tuned well. The performance of DNN is closely related to the tuning of hyperparameters, which is a time-consuming work for those who are not familiar to DNN. Humbird, KD et al. proposed a model referred to as ''deep jointly informed neural networks'' (DJINN), combining the advantages of both tree-based models and neural networks by initializing the structure of neural network according to decision trees [15]. However, the Xavier initialization used in DJINN drags the prediction accuracy with the increase of max tree depth.
In this article, a novel model called marine deep jointly informed neural network(M-DJINN) is proposed. The model has a well-tuned structure of DNN, and is applied to predict the dissolved oxygen concentration in the ocean. To improve the performance of DJINN in deep learning, we change the method to initialize the weights of neurons in neural networks, so that M-DJINN is expected to get a better accuracy and convergency than DJINN when there are more hidden layers. To accelerate the convergence speed, we also preprocess the data with standardization. Then, we explore the relationship between two hyperparameters of M-DJINN and the performance of the model. Finally, we compare the prediction accuracy between M-DJINN and some other algorithms, and validate M-DJINN on oceanic datasets WOD13.
The potential application of the proposed method is promising. For a long time, the tuning of hyperparameters has been an obstacle to researchers on the using of deep learning model. The proposed method would encourage researchers who are inexperienced on tuning hyperparameters to use deep learning model. Especially for oceanic researchers, with the explosive growth of marine data and the improvements on computing in the future, a deeper neural network model is expected to perform well.
The organization of the rest of the research paper is as follows: Section II includes the related work about DJINN and the motivation to improve the algorithm. Section III puts forward a novel model to maximize the advantages of both random forest and neural networks in deep learning. Section IV describes details of the experiment and Section V analyzes the experiment results. In Section VI, conclusions and the vision for further study are given.

II. RELATED WORK
To introduce the proposed method, a discussion on related work is given in this section.

A. A INTRODUCTION TO DJINN
DJINN is a model that combines tree-based models with neural networks. In this way, DJINN is flexible and scalable with few hyperparameters. This model is applicable to both classification tasks and regression tasks. To predict the dissolved oxygen concentration, DJINN for regression tasks is mainly discussed in this article. The core of DJINN is a mapping from decision trees to neural networks. Similar to random forest, users only need to tune two hyperparameters: tree number and the maximum depth of trees. Usually, the tree number and the depth of the trees have positive correlation with the accuracy of prediction, yet they are not recommended to be excessive large due to limited training time and memory demands. For different dataset, these two hyperparameters are suggested to be tuned specifically. After the mapping, DJINN gets the initialized neural networks, whose structure has been tuned well. Table 1 gives the description of symbols used in DJINN. For the level l in the decision tree, DJINN uses arrays W l of dimension n(l) × n(l − 1) and W D b +1 with dimension n(D b ) × N out to store initial weights of neural networks. The weights of most neurons are initialized with Xavier initialization, except that for i from 0 to N in − 1, W l i,i is initialized to 1 for l < L max i to ensure input values are passed through hidden layers until the decision tree no longer splits on them. When building the neural networks, DJINN uses the rectified linear unit for each hidden layer. During the optimization, the cost function is set to be mean squared error (MSE). The Adam optimizer [26] is used in DJINN for regression to minimize the cost function.

B. REFLECTION ON DJINN
DJINN employs the rectified linear unit to build the neural network. Rectified linear unit is a kind of neuron that uses a rectifier function as the activation function [16]: The rectifier function allows a network to obtain sparse representation easily, bringing benefits due to information disentangling, efficient variable-size representation, and linear separability.
On the same time, DJINN employs Xavier initialization to initialize the weights of neural networks. The algorithm chooses zero-mean normal distribution with variance of 3/(n in +n out ) to initialize weights of neurons in level i, where n in and n out represents the number of neurons in level i − 1 and the number of neurons in level i + 1 respectively.
However, to allow gradients to flow backwards, the expected performance of Xavier initialization relies on two assumptions [17]: • The activation function of the neuron f (x) is linear and symmetric about the y-axis.
• The variances of input features follow the same distribution. Obviously, the first assumption fails for the rectifier function, as the function is neither linear nor symmetric about the y-axis. This contradiction may result in the gradient diminishing, and the learning process of the model would probably be stalled. This finding motivates us to improve DJINN, and our improvements is to be presented in Section III.

III. METHEDOLOGY
With only few easy-to-understand hyperparameters, DJINN with Xavier initialization method can provide a relative high accuracy in prediction and brings much convenience to users. However, the first assumption of Xavier initialization is true only when the activation function is nonlinear, otherwise Xavier initialization may drag the performance on prediction error and convergence [18]. For deep neural networks trained with large ocean datasets, the disadvantage may be even worse. In our improvements, we still keep the second assumption of Xavier initialization.

A. INITIALIZATION OF THE NEURAL NETWORK
In neural networks that use f (x) as the activation function, for the level i, if we write W i for the weights vector of neurons, Z i for the activation vector, b i for the bias vector of neurons, and S i for the argument vector, we have Obviously, with Var(x) represents the variance of variable x, and b i initialized to zero, there exists: If we write w i , z i and s i for random variables of each elements in W i , Z i and S i respectively, and n i for the number of connected neurons in the parent level, then we get: Note that the following equations exists when w i , z i follow the independent identical distribution: where E(x) represents the expectation of variable x. Taking Equation (7) into Equation (6), we have: Here is the point. With the first assumption, the Xavier initialization simplifies Equation (8) by let E(z i ) equal to zero and get its conclusion. To further improve the effect of initialization, we substitute the Xavier initialization in DJINN VOLUME 8, 2020 with another initialization method which is designed specifically for nonlinear activation functions such as the rectifier function [18].
As we have discussed in Section II-B, the rectifier function is nonlinear and not symmetric about the y-axis. So E(z i ) is not zero now. We assume that E(w i ) is zero, and from Equation (8), we get: Note that If we let w i−1 have a symmetric distribution around zero and b i−1 initialized to zero, then s i−1 has zero mean and has a symmetric distribution around zero [17]. This leads to If we write g(s i ) as the probability density function of s i , then we have: Meanwhile, note that E(z 2 i ) can be written in: For rectified linear unit, this leads to: With the distribution of s i being symmetric around zero, s 2 i−1 g(s i−1 ) becomes an even function. Therefore, combining Equation (12) and Equation (14), we have: Combining Equation (11) and Equation (15), we have: Then Equation (9) transforms to: In DJINN, there are D n hidden layers. If we set L equal to D n , and put them together, we get: To keep the information flow, we expect Var(s L ) to be equal to Var(s 1 ) [18]. Thus, we have the conclusion of initialization method: From Equation (19) and the assumption that the weights have zero mean, this initialization method lets the weight w i follow a zero-mean Gaussian distribution and sets the variance of the distribution to n i /2. There are two points to mention. Firstly, for the first layer, n 1 · Var(w 1 ) is supposed to be 1 because there is no rectified linear unit applied on the input signal. However, for simplicity, n 1 · Var(w 1 ) can be just set to 2. Secondly, in this method, Equation (17) is applicable to either forward propagation or backward propagation and can keep information flowing under both circumstances [18].

B. M-DJINN ALGORITHM AND OPTIMIZATION
The inputs of M-DJINN are the maximum tree depth and the tree number. The output of M-DJINN is an ensemble of finely tuned and initialized neural networks, whose number is decided by the tree number. To begin with, according to the inputs, decision tree-based models would be constructed. Then, to do the mapping from a tree to a neural network, M-DJINN recurses the decision path of tree-based models twice, first to determine the structure and then to initialize the weights.
Algorithm 1 is presented to demonstrate the mapping from a single decision tree to the neural network. Define the primary branch of the tree as the l = 0 level, and the maximum tree depth as D m , thereby the maximum branch depth is D m − 1, denoted by D n . After the first recurse, max branch depth (D n ), max depth where each input occurs as a branch (L max i ) and number of branches at each level (N b (l)) are determined. The mapped neural network consists of an input layer corresponding to l = 0 level of the trees, D n hidden layers and an output layer. For M-DJINN that has N in inputs and N out outputs, there would be N in neurons in the input layer and N out neurons in the output layer. For hidden layers with n(l) neurons, n(l) can be defined as follows: For level l from 1 to D m , denote the tree node as t, the mapped decisive path of t as c, and the neuron created by the parent branch as p. If t is branch, then add a new neuron to this layer according to Equation (18). Connect p and the new neuron, c and the new neuron. If t is leaf, connect c to output directly. M-DJINN uses the initialization method in Section III-A to initialize the weights of neurons. The biases of neurons are initialized to 0. Repeat above steps for each tree nodes in the level l, and connected neural networks are built.
For M-DJINN with multiple decision trees, the mapping is supposed to be repeated for each tree and an ensemble of neural networks is created in this way.
To sum up, after the pre-training of decision trees, we construct the neural networks according to algorithm 1. Subsequently, we tune the weight of the neural networks by backpropagation. For regression tasks, such predicting MDOC, we choose MSE as the cost function of M-DJINN. To minimize the cost function, Adam optimizer is employed due to its low demands for memory and robustness. The M-DJINN model was implemented in TensorFlow [19].  Initialize W l p,p ∼ N (0, 2 γ )) for l from i + 1 to D n − 1 19: Initialize W D n p,out ∼ N (0, 2 γ )) 20: end if 21: end for 22: end for 23: Output the initialized neural network from decision trees to neural networks. Fig. 2 gives a vivid view of the relationship between a decision tree and a mapped neural network. According to M-DJINN, D n of the decision tree in Fig. 2(a) is 3, with N b (l) = (1, 2, 3, 0) for l = 0,1,2,3, and L max = (2, 2, 2) for z 1 , z 2 and z 3 . Weights of the neural network in Fig. 2(b) are initialized to 0 at the beginning, and elements whose levels are less than L max i on the diagonal of W l are initialized to 1 afterwards. To construct the structure of neural network, we recurse every decision path of the decision tree.
To begin with, we look at the decision path 1 presented by yellow lines. For level 1, add neuron n 14 and connect z 1 to n 14 , z 3 to n 14 . For level 2, add neuron n 26 and we find the neuron created by the parent branch of n 26 is n 14 , so we connect n 14 and n 26 firstly. Note that the mapped decisive path t of node z 1 ≤ d 4 in level 2 is z 1 to n 11 because the path transmits the value of z 1 . Thus, we connect n 11 and n 26 . Finally, decision path 1 leads to output A, so we connect n 26 and A. The corresponding elements of W are updated according to M-DJINN during the process.
As for decision path 5, in level 1, add neuron n 15 and connect z 1 to n 15 , z 2 to n 15 . In level 2, connect n 15 and n 25 because node z 2 ≤ d 2 connects to leaf A directly on decision path 5, and to transmit the value, we also connect n 25 to output A in level 3 directly.
The decision path 2, 3, 4, 6 and 7 can be recursed similarly to the decision path 1. One detail is that some neurons in the neural network are initially unconnected, which are painted in gray. These neurons could not learn, so they are not included in the final architecture of initialized neural networks.

IV. EXPERIMENTS
In this section, we use deep neural networks tuned by M-DJINN to predict the dissolved oxygen in the ocean. VOLUME 8, 2020 The performance is evaluated on World Ocean Database 2013 (WOD13) provided by National Science & Technology Resource Sharing Service Platform of China [20]. We also validate M-DJINN with several baseline methods. The results of the experiment in discussed in Section V.
The reason why we choose M-DJINN in this problem area is that the proposed method combines the advantages of both decision trees and neural networks. To the best of our knowledge, it is helpful in this problem area, because for researchers who are inexperienced in the tuning of deep learning models, the proposed method can reduce the workload. What's more, the proposed method is also observed to have a better performance on accuracy and convergence than other similar methods.
A. DATASETS WOD13 datasets [21] collects several important oceanographic variables annually, including temperature, salinity, phosphate, and dissolved oxygen. The ocean data in WOD13 are collected globally. The red points in Fig. 3 show the distribution of the ocean stations that provides bottle data for WOD13. We can see that WOD13 datasets comes from many kinds of ocean areas, including but not limited to coastal areas, gulf areas, the arctic ocean, and the equatorial ocean. It is quite difficult to build traditional or mathematical models that hold good in all those ocean areas. However, with plentiful training data, well-tuned deep learning models are expected to be capable to build a universal model for these ocean areas. We randomly choose the observed data from year 2001 to year 2010. The observation points are randomly chosen from the global ocean, and depth ranges from zero to 4200 meters. The predictors we choose include temperature and salinity, since warmer water holds less oxygen and salinity also affect oxygen solubility [10]. Other factors, including biological activity [22], longitude and latitude, ocean-atmosphere interactions and ocean circulations are considered to influence the concentration of oxygen too. So, we use phosphate, salinity, temperature, depth, silicate, longitude, latitude, and chlorophyll as the predictors in experiments.

B. PREPROCESSING DATA
This section presents the preparation before we apply the algorithms to datasets. After these works, we randomly select 80 percent of samples as the training data of each model, and the remaining 20 percent of samples are used as validation data.
In the original WOD13 datasets, due to different sources, some observed data may have quality problems such as excessive gradient and depth inversion. The WOD program has checked the datasets and labeled the error samples with flags. Hence, we filter the data via picking out samples whose flags are set to error in WOD13. Besides, for some reason, some samples lack values at some variables, so we also have to filter these samples out of the datasets.
After choosing 10,000 samples from the cleaned datasets randomly, we find that the ranges of features differ largely. For example, the value of depth ranges from 0 to 4200 meters, while the value of oxygen concentration ranges from 0 to 16 µmol/kg.
In training and fitting, the value of the cost function may be dominated by the large eigenvalue, which leads to the long running time of the program and the slow convergence process of the gradient. Considering that M-DJINN uses MSE as the cost function, it is necessary to scale features before training and fitting.
Here, we choose z-score model to standardize the data. Z-score model can make all the features have zero mean and unit variance with the following formula to transform data: where X denotes the feature, µ denotes the mean value of X, σ denotes the standard deviation of X, and X' denotes the result.
The comparisons between original data and transformed data on depth and temperature are shown in Table 2.

C. BASELINE METHODS
We validate the prediction performance of M-DJINN with the models in the following: 1) DJINN: DJINN is a neural network model whose structure is tuned by decision trees. This method is set to show the performance of finely tuned neural networks with the state-of-art technique. This model initializes the weights of neurons with Xavier initialization. 2) Random Forest: Random forest is an ensemble model based on decision trees. To get a good result, there are two hyperparameters required to be tuned: tree number and the maximum tree depth. To make the comparison fair, we set the tree number and the maximum tree depth in random forest respectively equal to the tree number and maximum tree depth in M-DJINN. 3) Support Vector Regression: Support vector regression (SVR) [23] is a statistical machine learning model for regression tasks. To use it, a kernel function should be selected carefully, and hyperparameters including penalty parameter C and kernel coefficient gamma, are also required to be tuned. To make the comparison fair, we use grid-search technique to tune these hyperparameters. We set C to range in {0.001, 0.01, 0.1, 1} and gamma to range in {0.5, 1, 1.5, 2, 2.5}. The kernel function of SVR is radial basis function(RBF).

4) Multilayer Perceptron: Multilayer perceptron (MLP)
is a neural network model with multiple hidden layers. It is generally considered as a subset of DNN.
In deep learning, the structure of MLP is supposed to be designed elaborately, which is difficult for researchers who are not familiar with deep learning. In the experiment, the method is set to show the performance of trivially tuned neural networks. We set the number of layers of MLP equal to the max tree depth in M-DJINN, and optimize MLP via stochastic gradient descent (SGD).

D. EVALUATION METRICS
To quantify the quality of a model's prediction, we measure different models via two metrics: Mean Squared Error, Accuracy and Explained Variance Score [24]. Mean Squared Error (MSE) is defined in the following: Here, the number of samples is n, y i represents the true value of sample i, and y i ' represents the predicted value of sample i. The smaller the MSE, the more accurate the predicted value is. From MSE, we define the accuracy of model as follows: Explained variance score (Expl_Var) is defined in the following: Here, Y is the estimated output, Y' is the true output with respect to Y, and Var is variance. As is shown in Equation (22), explained variance score ranges in [0,1]. In fact, the best score is 1, and the lower the score, the worse the prediction explained by the variance of dependent variables is.

V. V.RESULTS AND DISCUSSION
In this section, we evaluate M-DJINN model under different hyperparameters and compare the results with other common methods. We also analyze the convergence of M-DJINN. For the regression tasks in the following, we choose the predictors discussed in Section IV-A.

A. INFLUENCE OF TREE NUMBER ON M-DJINN
In this section, we discuss the influence of the number of decision trees on M-DJINN. The experiment is conducted on the training set and the validation set. We first set the limit on max tree depth to 5, and the number of decision trees in {4, 6, 8, 10, 15, 20}. The evaluation of metrics is shown in Table 3. It can be concluded from the results that in general, the more the decision trees, the more accurate the M-DJINN is. Nevertheless, when tree number increases to a certain level, such as 15 or 20 listed in Table 3, the model may become too complicated and result in the overfitting of model [25]. A suggested method to prevent this phenomenon is to increase the scale of the datasets.  Fig. 4(b), we set the max tree depth to 10. The curves of cost function show that the converge process of M-DJINN is relative fast, and M-DJINN is observed to start at, and often converge to a lower cost than DJINN under the condition.
The results supports our idea in Section III-A. Because of the different initialization methods, M-DJINN has a better performance on convergence than DJINN. As the number of layers grows, M-DJINN allows the forward and backward signal to flow, while DJINN fails to make it.
We also observe that the tuning on the max tree depth influence the cost effectiveness of M-DJINN. In experiment, we set the number of trees to 5, and max tree depth ranges  Fig. 5. We can see the performance of M-DJINN improves when the max tree depth increases from 6 to 8, and is the best at 10. When the max tree depth increases to 12 or 14, the convergence of M-DJINN drops. The result reveals that the max tree depth should be carefully optimized for each data set.
Because of the cost on pre-trained decision trees, the computation complexity of M-DJINN is higher than simple tuned MLP. However, to reach a comparable performance with M-DJINN, the hyperparameters of MLP are supposed to be finetuned, whose cost is much more higher than pretraining tree models [15]. Overall, M-DJINN is still a robust model with competitive cost effectiveness.

C. COMPARISON ON ACCURACY
In the section, we apply M-DJINN to predict the marine dissolved oxygen concentration and validate its performance by comparing its performance with DJINN and other baseline methods. We use the training set to trian the model, and use the validation set to validate the model. We set the number of decision trees to 5, and the depth ranges in {9, 10, 11, 12} for random forest, MLP, DJINN and M-DJINN. We also compare the performance between M-DJINN and Support Vector Regression. The MSE of SVR is 0.332, and explained variance score is 0.904. The final results of M-DJINN, DJINN, random forest, SVR and MLP are evaluated by MSE and explained vairance score. The comparisons between baselines and M-DJINN are shown in Fig. 6. Table 4 show the improvement rate of M-DJINN on MSE with different max tree depth.
We also conduct a t-test on MSE between M-DJINN and other baseline methods to verify the significance of the result. We assume that the performance of methods have identical average(expected) values. The result of t-test is shown in table 5, where all the p-values are less than 0.05. From this result, the hypothesis of equal averages is rejected, which supports that M-DJINN performance better than other baseline methods.
From Fig. 6, we can see that compared with roughly tuned deep neural networks, such as MLP in the experiment, deep neural networks that tuned by M-DJINN outperforms visibly. With our adaptation, M-DJINN also outperforms DJINN on different depth. Compared with other baselines, M-DJINN still performs better. Besides, along with max tree depth increase, the MSE of M-DJINN first drops to the smallest, and then increase slightly. On explained variance score, M-DJINN also performs slightly better than the other algorithms. Note that the max tree depth determines the depth of neural network, the results of experiment indicate that for deep learning, M-DJINN shows its advantage over the other baseline methods.
The result support our idea in Section III-A from the aspect of accuracy. For deeper neural networks, the gradients of DJINN would diminish when training the model, which drags its performance on accuracy. Meanwhile, the decision trees in random forest are confined to on-axis spilt traditionally, which limits its accuracy.  It is also observed that when the max depth of decision tree reaches a certain level, such as 12, the performance of M-DJINN may drop, which is probably caused by overfitting. However, as the scale of oceanic data is increasing nowadays, the influence of overfitting may drop with enough training data.
Considering the time cost and the benefits, we didn't use the grid-search technique to tune the hyperparameters of the proposed methods. In experiments, although the performance of M-DJINN improves with more decision trees, we find out that when the number of decision tree is more than 6, the performance stalls. Besides, we also observe that under   the same number of trees, M-DJINN reaches its best performance when the max depth is 10. So in the trade-off between time cost and accuracy, we set the number of decision trees to be 5, and the max tree depth to be 10. Table 5 shows the comparison between algorithms under these two hyperparameters. Compared with DJINN, random forest, SVR and MLP, the MSE of M-DJINN was reduced by 17.6%, 20.5%, 37.9% and 29.0% respectively.
In Table 4, we also calculate the accuracy of algorithms, and M-DJINN still performs better than other baselines. After training the M-DJINN model with 10,000 processed samples, we randomly choose 1,000 positions around the ocean and predict the MDOC on these positions. The curves of true value and predicted value are drawn in Fig. 7, and it can be observed that the predicted value fits the true value quite well.

VI. CONCLUSION AND VISION FOR FUTURE WORK
In this article, we put forward an adapted model, M-DJINN, to apply DJINN in deep learning. The novel model keeps the user-friendly features of DJINN, but achieves a reduced prediction error and a better convergence. We evaluate our algorithm on WOD13 datasets and the experiments show that with enough data to train and tune hyperparameters, the performance of M-DJINN is better than DJINN, random forest, SVR and MLP on both MSE and explained variance score. A potential limit of the proposed method is that VOLUME 8, 2020 M-DJINN is formulated for fully connected feedforward neural networks. However, for more complex neural networks such as ones with convolution layers, M-DJINN is still useful after the features are extracted by convolution layers.
Deep learning models with well-tuned structures can help us to learn the potential law of nature more easily. For researchers who are not familiar with tuning deep neural networks, our adapted model provides a user-friendly solution with better accuracy and convergency. We will continue to improve the deep learning models and predict the various hydrological features, and apply the results to a deeper knowledge of the complex dynamics that govern the geophysical and geochemical systems of our planet. LIHAO