Regression analysis for container ships in the early design stage

The seaway trade market has expanded in the last years and container ship dimensions are constantly increasing for higher cargo capacity. In the early design stage, main dimensions are usually determined based on an existing ship database from which regression formulas are derived. In the present paper, a database of 260 non-sister container ships built from 1979 to 2022, representing 20% of the world fleet, has been considered to derive and compare different types of regressions. Simple regressions have been developed and compared with equivalent formulations available in literature, proving better approximations of the trends. The study has been further extended by multivariable regressions and forest tree algorithms, which allow the use of more than one independent variable and provide a better fitting compared to simple regressions. Forest tree regressions return the highest values of fitting coefficients, but the technique is not of easy application due to the absence of mathematical expressions. The main contribution is the updated set of simple and multivariable regression formulas which have a higher goodness of fit than previous works and can be easily employed by designers in the early design stage and in multi-attribute design procedures.


Introduction
Within today's shipping, container ships are the trendsetters which have revolutionized the transport of goods.Even though the first "container ship" dates back to 1956, the massive "containerization" started only in January 1968 with the introduction of ISO 668 (1968), which defined terminology, dimensions and ratings for freight containers.In the last two decades, the shipping market witnessed the trend of a continuous increase of goods transported by containers, leading to an enlargement of the container ship fleet and the entrance of the Very-Large Container Ships, VLCS, and the Ultra-Large Container Ships, ULCS, with up to 23000 TEU.In parallel, two aspects related to container ships are drawing the attention of experts in the transport and logistic chain and of naval architects: ultimate limit ship dimensions and the yearly number of containers lost at sea.
Economics and logistics experts (Malchow (2017), Saxon and Stone (2017); Garrido et al. (2020)) analyzed the trends of container ships growth based on economies of scale, port infrastructure, demand, and environmental tendencies, to predict the ship size limits.According to Malchow (2017), a 30000 TEU container ship with approximately 20 m draught, should be the ultimate limit because of the depth constraints in the Malacca Strait and the Suez Canal.Saxon and Stone (2017) envisaged even 50000 TEU container ships in the next 50 years.Garrido et al. (2020) analyzed the design restrictions and stability regulations, the geographic and port restrictions, the economies of scale of container shipping lines, the CO 2 emissions of vessels, and the world economy, demand, and global trends.The authors concluded that all trends indicate 30000 TEU container ships on the market by 2030 and reported that according to the capital cost, the optimal ship size has 418 m length, 69 m breadth, 35 m moulded depth, and 17 m draught.They calculated the metacentric height and stated that its reasonable value should be between 4% and 5% of the breadth to avoid stability problems.Finally, the authors compared the "ideal" dimensions predicted by different authors, namely Park and Suh (2019), Kristensen (2012) and Korea Maritime Institute (2012).Under the assumption of a 30000 TEU ship, the predicted ship length, breadth and draught are then equal to 453 m-72 m -17.3 m (Park and Suh, 2019); 483 m-71.5 m -18.7 m (Kristensen, 2012);and 517 m-65 m -19.4 m (Korea Maritime Institute, 2012).As reported, the differences in draught and breadth are relatively small as they have upper limits dictated by water depth in ports and canals and by the arm-length of port cranes.The decreased "optimal" length clearly indicates the change in design trends of the last decade.
Another relevant issue is the number of containers lost at sea each year.In 2011 the World Shipping Council (WSC) started a survey among its members (covering more than 90% of the global container ship capacity) to accurately estimate the number of containers lost at sea each year.Reviewing the results of the total surveyed period 2008-2022(WSC, 2023)), the WSC estimated that there was, on average, a total of 1566 containers lost at sea each year.Average losses for the last three years were 2301 containers per year (2020)(2021)(2022).Up-to-date results indicated the parametric roll as one of the main reasons for container losses.Actions to preclude further accidents for the existing ships are the training of mariners to recognize and prevent the parametric roll (Galeazzi et al., 2013) and the application of operational guidance (Begovic et al., 2023) which clearly identifies speeds and headings where the ship may be vulnerable.For the new vessels, operators should consider from the design stage the vulnerability to stability failure modes, such as parametric roll (France et al., 2003), excessive acceleration (IMO SLF 54/INF.6, 2011) or pure loss of stability.
It is evident that, for such a competitive and demanding market, ship design must assure maximum performance in all these aspects using upto-date knowledge, software and technology.A reliable preliminary design assessing the main ship's characteristics is essential.As underlined by Papanikolaou (2014), the values of these characteristics mainly depend on the four main basic demands: cargo capacity, top speed, range/autonomy and class, but dominated by constraints such as the maximum value of breadth or draught to pass through the canals or enter in ports.The "traditional" approach to determe the main ship characteristics employs regression formulas obtained from a database of similar ships.When following the design spiral, few iterations are needed to retrieve the "optimal" main particulars.Kristensen (2012) published a set of linear and nonlinear regressions to estimate main dimensions, deadweight and various design parameters as a function of the number of TEU.For the development of these regressions, container ships built between 1990 and 2010 were classified into three groups (Small, Panamax and Post Panamax).
In the multi-attribute and multi-objective ship design (Zanic et al. (1992), Trincas et al. (1994), Grubisic andBegovic (2001, 2011), Mauro et al. (2019), Ljulj et al. (2020), regression formulas or "low-fidelity" codes are the basis of different modules which calculate ship attributes (power, deadweight, structural weight, total cost, etc.) based only on ship main parameters.It is evident that the accuracy of the obtained results will be strongly affected by the accuracy of the input regression formula.
In the last two decades, the application of artificial intelligence (AI) techniques, such as genetic algorithms, neural networks, and machine learning, is increasing in all phases of ship design.AI techniques are in continuous development, but the main idea is to find the optimal design starting from a few macroscopic parameters provided by the owner (such as cargo capacity, maximum speed, and range/autonomy).All other parameters are then estimated through a regression analysis of an existing ships database and/or applying some AI techniques to perform the optimization.One example is the automatic hull form generation, as shown by Islam et al. (2001).The authors performed a three-steps procedure starting from 104 ships' half-breadths, then used neural networks to adjust parameter values and finally used the genetic algorithm to design the body plan.Their work is one of the first examples where the advantages of both techniques have been combined: neural networks are used to identify the data pattern and to predict the required parameters and genetic algorithms are used for search-based optimization problems (i.e.maximization or minimization of the objective function), which are difficult and time-intensive to solve by other general algorithms.Clausen et al. (2001) acquired a database of 87000 ships and applied regression analysis, Bayesian and neural networks to encode the relations between ship main characteristics and loading capacity for different ship types.They concluded that neural networks are easier to implement and yield smaller estimate errors than Bayesian networks.This work remains, up to now, the one with the most extensive database used for the neural network.
The applications of AI as a predictive tool for seakeeping (Romero-Tello et al., 2022), fuel consumption (Uyanık et al., 2020), and corrosion damage detection (Yao et al., 2019) are numerous, showing their great potential when the model is developed from a big data sample, as for example from the data monitored on board during voyages.
Recent works on the application of AI in preliminary ship design, such as Ekinci et al. (2011), Gurgen et al. (2018), Cepowski andChorab (2021), andMajnaric et al. (2022), used artificial neural networks, machine learning and multiple regression analysis to develop design formulas considering databases of a few hundred ships.Concerning the size of the database, the efficiency of AI techniques can be discussed because, as the "rule of thumb", the minimum sample should be at least 30 times the number of weights (Alwosheel et al., 2018).
The present work reports the statistical and regression analyses of the database obtained by the Hyundai catalogue (HHI shipbuilding group performance record, 2022) of container ships built from 1979 to 2022.The objective of this work is to present an overview of the possible methodologies implementable for the determination of main design parameters for container ships.The application of AI technique (forest tree) has been explored to obtain the estimations with the highest coefficient of determination R 2 .Even though it is a black-box method and does not return any mathematical expressions, it provides better results than simple and multivariable regression methodologies.Therefore, the order of multivariable regressions models has been further varied to improve R 2 values.
The different regression methodologies are described in Section 2. In Section 3, a complete database of about 1000 ships is presented, but for all regression analyses a reduced database of 260 vessels without sisterships has been considered.In Section 4, simple regression formulas, of the same form as published in Papanikolaou (2014), have been developed to investigate the trend of new container ships.In Section 5, the simple regression formulas obtained from the present database have been compared with the ones developed by Cepowski and Chorab (2021) and the ones presented in Papanikolaou (2014).In Section 6, a new nonlinear multiple variable regression is performed, and the accuracy of the predictive models is discussed according to their mathematical formulation and different fit coefficients.In Section 7, a forest tree algorithm has been used to identify non-correlated parameters for the multiple regression analysis.In Section 8, an example of the design parameters prediction with all the regressions methodologies is provided for a container ship of 20000 TEU at a speed of 23 knots.Discussions and Conclusion are presented in Section 9.

Regression analysis methodologies
The regression formulas can be used to predict the value of one parameter based on the value of one or more variables and in this section the applied methodologies are described.
At first, simple regressions, have been developed in the forms of Equation (1) (polynomial, power or logarithmic) where one parameter y is function of only one variable x with the coefficients a and b: To increase the accuracy of the predicted values, multivariable regressions have been developed, where one parameter can be estimated based on two or more variables.The general model for multiple linear regression is given by the following matrix formulation: where Y is the matrix of measured values, X is the matrix of independent variables, c is the matrix of coefficient and d is the matrix of errors.
Using the matrix formulation, the unknown of the problem is the matrix c, obtained as follows: where X ′ is the transpose of matrix X and (X ′ X) − 1 is the inverse of (X ′ X).Finally, forest tree algorithms have been developed with different set of input variables to explore another technique and compare the various regression methods.
Forest tree is one of the most popular and commonly used algorithms by data scientists.Forest tree is a Supervised Machine Learning Algorithm widely used in Classification and Regression problems.It builds decision trees on different samples and takes their majority vote for classification and average in case of regression.Random forest is a versatile machine learning algorithm that leverages an ensemble of multiple decision trees to generate predictions or classifications.The random forest algorithm delivers a consolidated and more accurate result by combining the outputs of these trees.Its widespread popularity stems from its user-friendly nature and adaptability, which enables the effective tackling of classification and regression problems.The algorithm's strength lies in its ability to handle complex datasets and mitigate overfitting, making it a valuable tool for various predictive tasks in machine learning.One of the most relevant features of the Random Forest Algorithm is that it can handle the data set containing continuous variables, as in the case of regression, and categorical variables, as in the case of classification.It performs better for classification and regression tasks.
For all the different regression methodologies the following fit coefficients have been determined and compared to assess the accuracy of the regressions: coefficient of determination (R 2 ), Pearson coefficient, MAPE (Mean Absolute Percentage Error), RMSR (Relative Root Mean Square Error), and RRMSE (Relative Root Mean Square Error).In particular, for multivariable regressions the additional values of R adj 2 , SE (Standard Error), t-stud, and p-value have been evaluated.The formulations of the fit coefficients are defined as follows: where y i are the n observations, y i * the predicted values, y is the mean value of the observations, y * is the mean value of the predicted variable and n p is the number of parameters used in the regression model.The use of one or the other regression, depends on the available input parameters: in case only the number of TEU is available as an input, simple regression or forest tree have to be used; if also speed is known, multiple regressions or forest tree function of speed and TEU have to be adopted.Moreover, if the main dimensions are also available the corresponding multiple regressions or forest tree can be employed.In any case, being the forest tree algorithm a black-box methodology without giving the equation coefficients, the reproducibility of the results may be achieved only through trials of simple and multiple regressions.

Ship database and statistics
The database used for this study has been generated from the "Hyundai Heavy Industries Shipbuilding Group" catalogue (HHI shipbuilding group performance record, 2022), which includes 971 ships, 260 of which are not sisterships, and represents about 20% of container ships of the world fleet.The ships, built from 1979 to 2022, have been classified by main characteristics: length, breadth, moulded depth, draught, maximum speed, TEU, engine power, and delivery date, as summarized in Table 1.Only 13 ships are twin screws, all the others are single screw.
This database has been chosen since Hyundai can be considered the world's largest builder of container ships and no other trustful data has been available for this study.The range of data is very wide ensuring a considerable variability as shown in Table 1.Furthermore, the database ensures the use of the same definitions for all dimensions and variables among all the vessels.
The ships of both complete and non-sistership database have been divided into classes as shown in Fig. 1 The ships have been grouped based on the main dimensions.Based on ship length the samples have been sorted in intervals of 10 m, for both the total number of ships and for the non-sisterships, as shown in Fig. 2. It can be noticed that almost one-third of the complete database is composed of ships having length between 275 and 295 m (223 ships) and 345-355 m (95 ships).
From now on, the statistics and all data analysis will consider only the 260 non-sisterships.The container ships' main dimensions are not continuous variables due to the container dimensions, therefore the interval for the ship breadth and depth has been chosen as 2.6 m (2.54 m for container breadth/height and 0.06 m for the spacing).As shown in Fig. 3, most of the ships have breadth between 29.7 and 32.3 m, and  The ship length and number of TEU have been reported as a function of the delivery date and are shown in Fig. 6, while the relation between ship maximum speed, number of TEU and years is shown in Fig. 7.The upper limit of the length and the number of TEU can be easily described by linear and exponential trendlines, respectively.On the other hand, the speed is not correlated with the number of TEU, and it is not relatable to a known simple distribution.Before the 21 st century, ships reached a maximum length value of 300 m and a capacity of 7500 TEU.
Ships of 350 m and 10000 TEU started to be built around 2008 and ships of 400 m and 20000 TEU appeared in 2015.The fluctuating tendency of ship speed and its decrease during the last years, despite the increase in the number of TEU and dimensions, are probably related to the limit of marine engines of 80 MW and the optimization of the route for the greenhouse gas emissions and for scheduled travelling days.Nonetheless the number of TEU, the speed is in a range between 21 and 26 kn.

Simple regression analysis
In this section, simple regressions have been developed to estimate the different parameters.In the early design stage of a container vessel, the main requested parameters are the number of TEU and/or DWT and the speed.Therefore, these were chosen as independent variables for the regression analysis.Different types of simple regressions (power, logarithmic and polynomial), as reported in Appendix A, have been performed and their coefficient of determination (R 2 ) have been compared in Appendix C, Table C1.Moreover, for each ship dimension as a function of TEU the fit coefficients MAPE, RMSE, RRMSE and Pearson have been determined to identify the best type of regression and reported in Table C2.
The whole set of best regressions for ship main characteristics and the related formulas are reported in Equations ( 10)-( 31) and have been presented in Figs.8-18.Particular attention should be given to the regressions of the engine power, P ENG , when 40 MW are exceeded: as can be seen in Fig. 16, there is a huge scattering of data after this value and Equations ( 20) and ( 21) are valid only up to this power limit (a more detailed explanation is given in the following).
The relations between main dimensions are well approximated by a power function and the data points are close to the regression curve, as shown in Fig. 8.It can be noted that in many cases the B value is constant for an increasing length and the increase in B has a discrete step strictly due to the container dimensions.For the nondimensional ratios B/T and B/D as function of L/B shown in Fig. 9, the spreading of data indicates no correlation between the variables, and it is confirmed also by the low R of the obtained linear trendline.
Figs. 10-12 show the variation of each dimension as a function of the number of TEU.All dimensions are approximated with a power function except the length, which is better estimated by a logarithmic function.All data are well gathered around the tendency line, with values of R   around 0.95-0.98,except for the draught and depth which are more scattered and the R 2 value is about 0.85 and 0.89, respectively.
In the present database, the value of the deadweight (DWT) was not available, therefore the correlation between TEU and DWT, reported in Equation (32), from Abramowski et al. (2018), has been adopted to generate the necessary data to obtain the regressions as a function of DWT, as presented in Figs.13-15.
DWT regressions were best fitted by linear equations for the product of the main dimensions, as in Fig. 13, and by power formulas for the single dimensions (Figs. 14 and 15), except for the length, fitted by a logarithmic function.In all cases, the data samples are close to the regression curves; few points are more scattered when analyzing the depth values, especially for D = f(DWT) where the R 2 value decreases to 0.846.This may be due to the change of D value when keeping constant the B and T values.
It can be noted that the increasing step of B and T variables is discrete  for both TEU and DWT regressions, due to container sizes and limiting channels dimensions.
In the analysis of the installed power as a function of ship length or speed, two regions have been identified in Fig. 16: a regression curve and a rectangular area.When ships have an engine power lower than 40 MW (data represented by black dots) the regression curve has been approximated with a power function, reporting the formula in the graphs.For ships with higher engine power, the dependency of engine power from ship length or speed can no longer be represented by a regression curve.The data are very scattered and a rectangular region can be identified.It is clearly visible that for the VLC and ULCS the ship speed varies from 21 to 26 knots, probably set as the design requirement to serve some specific route in a certain number of days.
The engine power and the ship speed as function of TEU or DWT are presented in Figs. 17 and 18, respectively.The data is scattered and uniformly spread around the graph.Although the best regression is represented by a power function, the R 2 values are lower than 0.6 for engine power and lower than 0.2 for speed, highlighting a low direct   dependency of these two parameters from cargo capacity.

Comparison with previous studies
The present regression equations have been compared with the ones reported by Cepowski and Chorab (2021) and in Papanikolaou (2014) who recalled formulations and regressions developed by Kolakarinos et al. (2000Kolakarinos et al. ( -2005)).
Since in Papanikolaou (2014) all ship dimensions are estimated only as a function of the DWT, the correlation between DWT and TEU, obtained by the regression analysis in Papanikolaou (2014), is reported in Equation ( 33) and has been used for the comparison in terms of TEU.
As already mentioned in Section 2, the value of DWT for the present database has been obtained using Equation (32) (Abramowski et al., 2018).
Table 2 and Figs.19-23, illustrate the comparison between the formulas of the regression analysis for the present database and the ones reported in Papanikolaou (2014) and Cepowski and Chorab (2021).
Fig. 19 compares the results of the present analysis and the ones from Papanikolaou (2014) in terms of ship dimensions, velocity, and number of TEU.While the dependency of breadth on length has a similar tendency for both regressions in all the dimensions range, the estimations of the product of LBD follow two different tendencies for values greater than 5000 TEU.The database used in Papanikolaou (2014) included ships built up to 300 m and 80000 DTW (about 6000 TEU), and the estimation of bigger ships with the extension of the original regression formula is overpredicted.The present regression gives a better overall approximation of the database for velocity dependency on TEU, even though the higher speed values are not caught with a power regression.
Figs. 20 and 21 show the comparison between the three regressions for the estimation of ship dimensions as functions of TEU.While breadth and draught approximations are in good agreement in all three cases, for length and depth it is clear that: when extrapolating the formulas of regression in Papanikolaou outside the limits of the referenced database (about 1000 TEU), the values are overestimated and do not follow the tendencies of new built ships; and that the formulas in Cepowski and Chorab (2021) have been reported with some typo-errors.
The obtained regression formulas for the main ship dimensions have been used for an hypothetical 30000 TEU ship and compared with Garrido et al. (2020).The estimated length for 30000 TEU results equal to 421 m for the present database regression, 433 m calculated with Ceposwki's formulation, and 531 m following Papanikolaou regression formula.Since the length evaluated by Garrido et al. (2020) is about 420 m, the best fitting curve that estimates the closest value is the one obtained by the present database.
Ship dimensions as functions of the DWT are reported in Figs.22 and 23; in this case, the estimations of Papanikolaou are in good agreement with the other two analyses.This difference with the tendency found for TEU dependencies may be attributed to the different correlation formulas adopted for DWT and TEU.
In the works of Papanikolaou (2014) and Cepowski and Chorab (2021), R 2 values were not available for all formulas, therefore, to evaluate the goodness of the formulas with the present database, the R value have been calculated using Equation (4) considering the present database as the observed value and the estimated values have been predicted using the formulas as reported in Table 2.The calculated R values for the ship dimensions as function the number of TEU are presented in Table 3.For Cepowski and Chorab (2021) the R 2 values have been calculated only considering the DWT formulation due to the typos in the formulas as function of TEU.It can be seen that except for the B value, all others are better predicted by the formulas of Cepowski and Chorab (2021).The comparison for ships speed has not been reported since the correlation with TEU is very low.

Multivariable regressions
In this section, different multivariable regressions (MR) are presented in the following forms.
• MR1 for ship dimension and engine power function of V and TEU; • MR2 for ship speed and engine power function of L, D and TEU; • MR3 for ship speed and engine power function of L, T and TEU; • MR4 for ship speed and engine power function of L, B, D and TEU.
For each regression the values of estimates, SE, t-stud, and p-value have been evaluated and reported in Appendix B. The whole set of fit coefficients, R 2 , R adj 2 MAPE, RMSE, RRMSE and Pearson, has been reported in Table C3.An extended analysis of the multiple linear regression has been conducted.The multicollinearity has been checked according to the VIF (Variance Inflation Factor), highlighting no collinearities for all the regressions function of V and TEU, as shown in Table C5 in Appendix C.There is multicollinearity in the case of the regressions provided as a Fig. 20.Ship dimensions and TEU regressions compared to Papanikolaou (2014) and Cepowski and Chorab (2021).function of L, B, D and TEU.Power and speed regressions function of L, D, TEU and L, T, TEU, highlight a moderate collinearity compared to the previous regressions.For all the regressions the normality of the residuals has been checked with the Kolmogorov-Smirnov test, giving positive results for all tested cases.Furthermore, heteroscedasticity has been evaluated according to the Breush-Pagan test, as shown in Table C6 in Appendix C, detecting homoscedasticity only for the regression of B as a function of V and TEU and for V as a function of L, T, TEU.All other cases are affected by heteroscedasticity of data, decreasing the reliability of the final regression.In the MR4 regressions, with 4 dependent variables, heteroscedasticity is associated with the presence of multicollinearity, in other cases deals only with the nature of data.In any case, being the objective of the regressions the estimation of the independent variable and not the influence of each parameter on the final regression, the detection of non-constant variance does not require the manipulation of input data to eliminate the problem.The normality of multiple linear regression residuals can be found in Appendix D.
Each ship characteristic estimated by the different multivariable regressions is compared with original corresponding value of the database, as shown in Figs.24-29.In each figure the spreading of data around the bisector indicates the goodness of fit of each regression; the points are gathered around the bisector line and the closer they are to the bisector line the better they are estimated by the regression formulas.
In Figs.24 and 25, the estimated ship dimensions (L, B, D and T) are functions of V and TEU, and while for length and breadth the values are well gathered around the bisector line, highlighting the goodness of the regression formulas, for the depth and the draught the values are more spread.The fit coefficients in Table C3 confirm these tendencies, with R 2 , R 2 adj , RMSE and Pearson coefficients higher and MAPE and RRMSE coefficients lower for L and B dimensions.In the graph reporting the breadth, the discrete step linked to container ship sizes is recurring.
Figs. 26-29 present the comparison of different regression methods for engine power and ship speed.As confirmed also by the fit coefficients in Table C3 the MR4 regressions are better estimated, even though the small improvement may not be worth the increment of input variables.In particular R 2 , R 2 adj and Pearson coefficients are higher for MR4 and MAPE, RMSE and RRMSE coefficients are lower.It is worth noting the high scattering data around the ship speed of 21 and 22 knots where for a constant value of the present database the estimated values vary in a range of about 2 knots.A similar tendency can be seen for the engine power around 68000 kW.This may be due to the limitations in actual engine performances and fixed ship speed in trip voyages for different ships.

Forest trees
Besides multiple linear regressions, forest tree regressions are a suitable advanced technique to investigate the dependencies of the main dimensions of the container ships from one or more parameters.The forest tree algorithm allows the classification of the output through the averaged prediction of more individual trees (Ho, 1998), thus reducing the overfitting problem of individual trees.Here, the MATLAB application for the determination of forest tree is applied to the database, providing regression for the quantities of interest.The forest trees for simple regressions (parameters as a function of TEU) and multivariable regressions (MR1, MR2, MR3 and MR4) have been performed to estimate ship main dimensions, ship speed and engine power.The values estimated by the forest tree algorithm have been compared with the original ones of the present database, as described for the multivariable regression and are reported in Figs.30-38 with the corresponding coefficient of determination.In the graphs the subscript SR defines the simple regression and MR i the multivariable regression approximations.2014) and Cepowski and Chorab (2021).

Table 2
Comparison of the obtained regression formulas from the literature.

Table 3
Comparison of coefficients of determination with Papanikolaou (2014) and Cepowski and Chorab (2021).As expected, for all parameters, except for the depth, data is less scattered for multivariable regressions than for simple ones and it is confirmed by the high value of R 2 .In particular, Figs. 30 and 31, presenting L and B values, have well-gathered data around the bisector, while T and D values,shown in Figs. 32 and 33, are more scattered, as already observed for multivariable regressions in Section 6. Figs.34 and 35 present the different regressions for the ship speed and Figs.36-38 the regressions for engine power.The best fitting, with higher R 2 , is estimated by MR2, differently than in the case of multivariable regressions of Section 6, where the best fitting was found for MR4.Overall, the forest tree algorithm better fits the database compared to the linear multivariable regression in Section 6, as can be seen from the graphs and the values of all fitting coefficients in Table C4 of Appendix C.
The disadvantage of a forest tree is the absence of a simple regression formula for determining the desired variables.However, having at disposal a database to determine the forest tree allows for applying this method also in the early design stage.In particular knowing the   maximum coefficient of determination R 2 from forest tree, it is possible to investigate which of simple or multiple, linear or nonlinear regression formulas reaches this value.

Application example
To have an insight into the possible results when estimating ship parameters in the early design stage, a container ship of 20000 TEU designed to sail at a speed of 23 kn has been considered as a test case.
The different regression methods have been used to estimate ship dimensions, velocity, and engine power and the results are shown in Table 4.For the single regressions the parameters are function only of the TEU variable and Equations ( 16)-( 19) and ( 28) and ( 30) have been applied.For the multivariable regressions ship dimensions and engine power have been estimated from the equations in Appendix B in the form of Y = f (TEU, V).
The generic equation for a specific parameter can be written as:   Y= where c i are the estimates of the variables V k and TEU t and d is the intercept.Forest tree regressions have been also applied.
Table 5 presents the percentage difference between the values estimated using forest trees and multivariable regressions and the values obtained by the single regressions.It can be noticed that a great difference appears for the estimation of engine power, highlighting the better regressions obtained by multivariable formulations and forest tree algorithms.

Discussion and conclusions
In the present work, a database of container ships representing the 20% of the world fleet, with vessels built up to 2022, has been analyzed.The database includes about 1000 ships, 260 of which are nonsisterships.A complete regression analysis, using methods of different

Appendix C
In this Appendix the coefficients of determination for simple and multivariable regressions are compared.

Table C1
Coefficients of determination R 2 for simple regressions : Small Feeder (up to 1000 TEU); Feeder (1001-2000 TEU); Feedermax (2001-3000 TEU); Panamax (3001-5100 TEU); Post-Panamax (5101-10000 TEU); New Panamax (10000-14500 TEU); Ultra Large Container Vessel (14501 TEU and higher).This range division follows Kristensen (2012) but each grouped database would have been too small to perform the regression analysis.Most of the ships fall in the classes of Panamax and Post-Panamax and only about 10% of ships (102 out of the complete database) have less than 2000 TEU.Fig. 1 also presents the number of ships delivered each year, and, although the first ship was built in 1978, only 33 non-sisterships (and 111 out of the complete database) were built before 2000, 227 non-sisterships were built from 2000 to 2022 and 55 ships of them are from 2015 to 2022.
draught between 11 and 13 m, which are the maximum allowable dimensions to cross the Panama Canal.The statistics on nondimensional values reported in Figs. 4 and 5 highlight that: a considerable part of ships (138 out of 260) has L/B in a range of 6.5-7.5;73 ships have B/T between 3.2 and 3.4; and 183 have T/D between 0.48 and 0.58.The most frequent depth values go from 21 to 25 m as shown in Fig. 5.

Fig. 1 .
Fig. 1.Database statistics by class dimensions and delivery date.

Fig. 16 .
Fig. 16.Regression analysis of engine power on ship length and speed.

Fig. 17 .
Fig. 17.Regression analysis of engine power on TEU and DWT.

B
.Rinauro et al.

Fig. 24 .
Fig. 24.Multivariable regression of ship dimensions function of V and TEU.

Fig. 27 .
Fig. 27.Multivariable regression of ship speed and engine power function of L, D and TEU.

Fig. 28 .
Fig. 28.Multivariable regression of ship speed and engine power function of L, T and TEU.

Fig. 29 .
Fig. 29.Multivariable regression of ship speed and engine power function of L, B, D and TEU.

Fig. 38 .
Fig. 38.Engine power estimation by forest tree multivariable MR4 regression and all regressions together.

Table 1
Container ship database characteristics.

Table C2
Comparison of goodness of fit coefficients for the simple regressions

Table C4
Comparison of goodness of fit coefficients for forest tree

Table C6
Breush-Pagan test results for multiple regressions B.Rinauro et al.