Enhancing Movie Recommendations: An Ensemble-Based Deep Collaborative Filtering Approach Utilizing AdaMVRGO Optimization

ABSTRACT


INTRODUCTION
Over several decades, recommendation systems have emerged as a promising research field, harnessing user preferences to curate effective recommendations [1][2][3].The algorithms underpinning these systems are widely employed across diverse domains ranging from movie and music recommendations to product suggestions, crop selection, and e-book recommendations [4].The primary objective of a recommendation system is to predict users' preferences for movies, music, products, crops, books, and news based on their historical preferences [5][6][7][8].
Broadly, recommendation systems can be categorised into three primary types: collaborative filtering, content-based filtering, and hybrid filtering.Among these, collaborative filtering has been acknowledged as the most effective and widely utilised method for generating recommendations [9].This technique predicts user ratings by leveraging historical ratings and ratings from similar users [10][11][12].Collaborative filtering can be further divided into two subcategories: memory-based and model-based methods [13][14][15].The former, also termed a "neighbourhood-based approach," employs statistical techniques to identify similar users and items, with unknown ratings predicted based on these similarities.The latter, on the other hand, uses machine learning techniques to learn the rating matrix and predict unknown ratings.However, collaborative filtering techniques face significant challenges, including data sparsity, scalability, cold start, and shilling attacks [16], all of which can negatively impact the accuracy and performance of recommendations.Data sparsity, referring to the paucity of data in the rating matrix due to unknown ratings, is a particular challenge.Matrix factorization, an efficient model-based collaborative filtering strategy, has been used to tackle this problem [17].It decomposes the rating matrix into a pair of low-dimensional rectangular matrices and predicts unknown ratings using a simple dot product [11,15,18].However, this linear dot product may not fully encapsulate the complex interactions between users and movies.Hence, current research efforts are directed towards the application of deep learning algorithms to address the limitations of traditional recommendation systems [19].Deep learning-based latent factor models are adept at capturing the hidden, non-linear, and crucial interactions between users and movies, thereby enhancing prediction accuracy [17].
Ensemble learning is a powerful strategy renowned for enhancing the accuracy, resilience, and diversity of learning algorithms.Recently, recommendation systems that leverage ensemble learning have gained considerable attention [20].Deep ensemble models amalgamate the benefits of both deep and ensemble learning, thereby improving the overall performance of recommendation systems [20,21].Ensemblebased deep collaborative filtering holds several advantages over deep collaborative filtering, such as: 1) superior performance over individual models, particularly in the presence of noisy data; 2) the ability to generate more diverse recommendations compared to individual models; and 3) robustness against overfitting by virtue of reduced model variance.
In this paper, we propose a deep collaborative filtering framework, termed Deep CF-AdaMVRGO, for movie recommendation systems to effectively handle sparsity issues [22].This framework is underpinned by an optimizer known as the Adaptive Moment Variance Reduced Gradient Optimizer, as proposed in the study [11].The deep collaborative filtering framework employs an ensemble of multilayer perceptrons to augment prediction performance.Each sub-model is fine-tuned using the proposed AdaMVRGO optimizer to minimize the reconstruction error.The final predicted rating is obtained by averaging the predicted values derived from the sub-models within the ensemble [11].
Typically, optimization algorithms like SGD, Adagrad, RMSPROP, ADAM, and SVRG are harnessed for parameter tuning.While ADAM is often considered the preferred optimization algorithm in deep learning applications, it is known to suffer from slow convergence and generalization issues [23].Convergence problems persist with ADAM due to high variance.The proposed AdaMVRGO algorithm is conceived by combining the strengths of the ADAM and SVRG optimizers [11].One key advantage of the proposed algorithm is that it manages to reduce variance during the gradient calculation itself.The first and second moments are computed based on these variance-reduced gradients.The proposed framework, when trained with AdaMVRGO, demonstrated superior performance compared to Adagrad, RMSPROP, ADAM, and SVRG.
The remainder of this paper is organized as follows: Section 2 presents a literature review on "movie recommendation systems".Section 3 discusses the prerequisite groundwork for the proposed framework.The proposed framework and its pseudocode are presented in Section 4. Section 5 provides a comprehensive discussion of the findings and interpretation of the results.Finally, Section 6 concludes the paper and outlines potential future enhancements.

LITERATURE STUDY
Huang et al. [31] introduced a technique known as Neural Embedding Collaborative Filtering (NECF) matrix factorization.This method utilizes a probabilistic autoencoder to generate neural embedding vectors from user-item input.These vectors are subsequently used to represent the user's hidden features via a regression equation employing singlepoint negative sampling [24,31].The method was tested on the M-1M and Pinterest datasets, and the results revealed that this strategy outperformed its baseline counterparts.However, it still grapples with the issue of data sparsity.
Sun et al. [24] developed an innovative framework that amalgamates plot texts and movie ratings to enhance prediction accuracy.They designed and tested a deep plotaware generalized matrix factorization on the Movielens datasets [24,31].Even though the integration of additional knowledge improved the model, it led to increased computational complexity.
Feng et al. [32] proposed a unique similarity approach that considers both linear and non-linear correlations.To bolster prediction accuracy and resilience, this method combined multifactor similarity with global rating information.However, the approach was found lacking in terms of scalability.
Taking into account the historical data of users and items, Fu et al. [25] devised a novel deep learning technique for movie recommendations.Initially, user and item vectors were trained to incorporate semantic information, reflecting the relationships between items and users [11].A multi-view feedforward neural network was then employed to predict ratings from these embedded vectors.The proposed model was evaluated on the MovieLens 1M and MovieLens 10M datasets [25].Despite its promising performance, the model was computationally intensive and encountered issues with generalization.
Xue et al. [33] proposed an item-based collaborative filtering model using deep learning, dubbed Deep-ICF, to decipher the nonlinear and intricate relationships between items.Deep-ICF employs a simple average to calculate the predicted rating.An extension to this model, termed Deep-ICF+a, employs adaptive pooling with attention to discern higher-order interactions between items.While promising, these models are computationally demanding and are more susceptible to overfitting.
It is noteworthy that previous works often necessitate extended computational time to converge and frequently encounter generalization issues.More recent research has concentrated on developing sophisticated models that incorporate deep learning techniques.These models utilize neural networks to capture complex user-item relationships and accommodate a better representation of intricate interactions.Moreover, optimization algorithms such as stochastic gradient descent have been deployed to diminish computational time and promote convergence.These advancements have significantly ameliorated the accuracy and scalability of recommendation systems, thereby addressing the limitations of prior works.
The key contributions of this paper are as follows: 1. We have established a deep collaborative filtering framework to capture the hidden, nonlinear, and pivotal relationships between users and movies.2. The framework employs an ensemble of multilayer perceptrons to train the rating matrix and predict unknown ratings.This ensemble improves generalization by training each MLP within the ensemble with a unique set of latent factors.3.Each sub-model within the ensemble is trained using a novel optimizer known as AdaMVRGO, with the aim to reduce the reconstruction error.4. Experiments were conducted on the M-100K, M-1M, and M-10M datasets.The results demonstrate that the proposed framework outshines existing methods.

Data representation
Let ′′, ′′ represent the number of users and movies in a movie recommendation system, respectively [34,35], so x represents the rating matrix's size, which is defined as R = {   , 1≤≤,1≤≤ }, ratings given by the user for a movie [6].For our work, we used explicit ratings, i.e., ratings given explicitly by the user.An explicit rating is quantifiable feedback given by the user for a specific movie.It determines the extent to which a user prefers the movie.Most of the time, users may not be interested in giving their feedback explicitly, resulting in a sparse rating matrix.This data sparsity hinders the recommendation system's precision and performance [35].

Matrix factorization
It is a latent factor model to address sparsity issue [36,37].It predicts the unknown ratings by splitting the rating matrix 'R' into two latent matrices called user matrix 'P' and movie matrix 'Q' of order ' x ' and ' x ', respectively [38].The user matrix 'P' depicts the associations across the users and ′′ latent features.Similarly, the movie matrix 'Q' depicts the associations between the movies and the ′′ latent features.Eqs. ( 1) and ( 2) indicate the rating matrix 'R' as well as the latent matrices 'P' and 'Q' [39].The predicted rating matrix is generated using Eq.(3).
where, ̂  is the predicted rating of user 'i' in the user vector   and movie 'j' in the movie vector   [24].Eq. (5) shows the deviation between the actual and expected ratings [40].
To avoid overfitting, the model is redefined in Eq. (6).
where,  is a regularisation parameter, ‖  ‖ 2 and ‖  ‖ 2 are the l2-norms of the user vector   and the movie vector   , respectively [41].However, the problem with traditional matrix factorization is that the dot product is linear, and it cannot capture user-movie interactions completely.Many studies have shown that deep neural networks, rather than a simple dot product, can capture user-item interactions nonlinearly well.

Neural network ensemble
Ensemble learning is the process of combining the different models' predictions to enhance their accuracy [42,43].Neural networks are non-linear, can learn complex patterns in the data, but suffer from high variance.Ensemble modelling of neural networks helps to reduce the variance by building different models as opposed to just one and combining their predictions to enhance the overall performance.The benefits of this ensemble are: First, it can help reduce the variance of the predictions, thereby improving the model's generalisation performance [44].Second, it helps to increase the precision of the predictions by combining the strengths of multiple neural networks [45].Third, it can help the model handle noise and errors in the data better.The various ways to combine the predictions in ensemble learning in movie recommendations include: • Voting: This is the simplest method, and it simply takes the most common prediction among the individual models.For example, if three models predict that a user will like a movie and two models predict that the user will not, then the ensemble model will predict that the user will like the movie.
• Averaging: This method takes the average of the individual predictions.This helps to reduce the variance in predictions, thereby improving the model's generalisation performance [44].
• Weighted averaging: This method weights the individual predictions according to their accuracy.It helps enhance prediction accuracy by assigning more weight to the more accurate models.
• Stacking: This method uses a meta-model approach to make the final predictions.The meta-model uses the predictions of each model to make final predictions.This helps improve prediction accuracy by combining the strengths of each model [46].
The selection of ensemble methods is applicationdependent.If the objective is to improve prediction accuracy, for instance, then a voting or weighted averaging scheme may be a better choice.However, if the objective is to develop a model more robust to noise, then an averaging scheme may be a better choice [6].

Optimization
The optimization algorithm has become essential for training a deep learning architecture.It starts with defining the loss function and ends with minimizing it using any gradient optimizer.Gradient-based optimization algorithms are extensively employed in latent factor-based collaborative filtering algorithms to learn explicit rating-based recommendation models [18,47].There are various optimizers to serve this purpose.The major problem with gradient-based optimisation algorithms is determining the learning rate, as model convergence depends on the learning rate.Lower learning rates need more time for convergence, while higher learning rates may skip the optimal solution.
The ADAM optimizer is a widely used stochastic optimization algorithm for deep learning models [23].Even though it is considered the best adaptive learning optimization algorithm, it still suffers from convergence problems due to its inherent variance.Variance makes convergence harder, especially when parameters are close to their optimal values.According to conventional research, a declining learning rate can help reduce variance [48,49].When the learning rate is low, however, the training loss converges slowly [50].We use a stochastic variance-reduced gradient in this paper to reduce variance during the ADAM process.This paper involves generating AdaMVRGO by reducing the variance of ADAM using SVRG optimizers to learn the recommendation model.
ADAM [51], a variant of the stochastic gradient algorithm [52].ADAM combines the benefits of the AdaGrad [53] and RMSProp [54] algorithms.ADAM works well with both sparse and noisy data.The first moment uses an exponential weighted average of past gradients to converge to the minima faster, which is given in Eq. (7).
The second moment, which is given in Eq. ( 8), employs exponential moving averages (i.e., the cumulative sum of the gradients, uncentered variance).
where,   : sum of the gradients at time 't',  −1 : sum of the gradients at time t-1,   2 : sum of squares of gradients at time t,  −1 : sum of squares of gradients at time t-1,  : gradient of the loss function L w.r.t, ,   : learning rate at time t,  1 and  2 are the moving average parameters (0.9),  +1 represents weights at time t+1 and   represents weights at time t [55].
As the first and second moment vectors are initialized to zero in the algorithm [56], it is observed that the algorithm is biased towards zero.Hence, these vectors are corrected, as shown in Eqs. ( 9) and (10).
Finally, the ADAM update rule is given as shown in Eq. (11).
Although ADAM is considered the best optimization algorithm for deep learning applications, it still suffers from slow convergence and generalization issues.Wilson et al. [57] proved that adaptive algorithms are less generalizable than SGD [58].Liu et al. [23] used adaptive gradient methods as opposed to nonadaptive gradient methods to address the issues of poor convergence for specific objective functions, no advantage from utilising moving averages, and the lowest generalisation performance.This may be due to the presence of large variance in the early epochs of training a model.Lower variance improves the convergence rate without any change to the learning rate [59].Variance reduction methods are used to reduce the variance and achieve better generalization and a fast convergence rate.

Proposed framework
As matrix factorization cannot completely capture the latent features, we employed neural networks to predict the sparse ratings [60].Neural networks are usually non-linear and can capture trivial or complex relationships between the user and movies.But neural networks suffer from high variance.So, in this paper, an ensemble of multilayer perceptrons is used for a movie recommendation system.An ensemble of neural networks is considered to reduce variance and generalization error.The proposed Deep-CF framework is presented in Figure 1.
Deep collaborative filtering (Deep CF) uses an ensemble of multilayer perceptrons to train the model by capturing the nonlinear interactions between the user and movies.The proposed framework uses three multi-layer perceptrons (MLP): -M1, M2 and M3.M1 uses one hidden layer, M2 uses two hidden layers, and M3 uses three hidden layers with dropout.The predicted rating from each MLP is optimized using a novel optimizer called AdaMVRGO to reduce the reconstruction error.The novel optimizer cuts down on noise by figuring out the first and second moments using the SVRG algorithm's variance-reduced gradient.This leads to fast convergence.The final rating is obtained from averaging the predicted ratings of the three models (M1, M2, and M3).The structure of an MLP used in the ensemble framework is presented in Figure 2.
In this layer, the rating matrix is split into user and movie embeddings, which are considered latent features embeddings [11].In the merge/concatenation layer, the latent features are given to a simple dot product [11].This result is sent to an ensemble of MLPs to anticipate the sparse rating.The formula for predicting the rating [7] of each MLP is shown in Eq. (12).
where,  ∈  × and  ∈  × represent the latent matrices of users and movies, respectively [61].̂  represent the predicted score given by the user   and the movie   .  indicate the parameters of the function f.The objective functions for M1, M2, and M3 are given in Eqs. ( 13), (14), and (15).
where, l(.) represents the loss function,  > 0 represents the step size, and ∁(  ) is the regularizer.The dropout is determined by using the Bernoulli distribution.The final rating is computed using Eq.(19).

Learning using AdaMVRGO
Although ADAM is considered the best optimization algorithm for deep learning applications, it still suffers from slow convergence and generalization issues.Variance reduction methods are used to reduce the variance and achieve better generalization and a fast convergence rate.Variance reduction methods have recently become popular and are the best alternative to nonadaptive gradient methods such as SGD.These methods typically reduce the variance of stochastic gradients by taking a snapshot of the gradients for each 'm' optimization step.This snapshot's gradient information can be used to reduce the variance of subsequent smaller batch gradients [59].SAG [64], SAGA [65], and SVRG [66] are the most standard variance reduction methods.The stochastic gradient variance is cancelled by SVRG with the control variate (the average of gradients computed at different snapshots), which has a zero exception.The proposed algorithm, AdaMVRGO, is developed using the ADAM and SVRG optimizers [11].Algorithm 1 depicts the pseudocode for AdaMVRGO.Similar to SVRG, AdaMVRGO uses a nested looping construct.The outer loop consists of 'K' iterations, and the inner loop consists of 'm' iterations.The outermost loop determines the complete gradient at a random snapshot.This snapshot is updated after every 'm' step of parameter updating.In the inner loop, uniformly select a random data point and find the gradient gt using the SVRG principle.This gradient is the variance-reduced gradient as shown in Eq. (20), which assists in approaching the optimal value with a constant step size δ [67].The weight update rule is shown in Eq. (21).
Typically, the next snapshot point is set to the inner loop's final iteration value at the end of the inner loop [67], . Adaptive optimization algorithms such as ADAM still suffer from convergence issues due to the presence of high variance.The proposed algorithm's change is that it reduces variance while calculating the gradient itself.The first and second moments are computed on these variance-reduced gradients.
The advantages of the AdaMVRGO algorithm compared to others are listed as follows: • Experimentally, our proposed optimizer is robust compared to previous optimization approaches.
• Variance-reduced gradients are computed initially before smoothing them.• The first and second moments are computed based on the variance-reduced gradients.• The overall variance is reduced after some epochs, which results in fast convergence.

Algorithm 1: Pseudocode of AdaMVRGO algorithm
Initialize ω 0 (starting point), K (outer loop), m(inner loop) for s = 0 to K-1, do Determine the total gradient of the objective function, ∇L(ω s ) Initialization: Initialize the first and second moment vectors u 0 = 0 and v 0 = 0, respectively, and the exponential decay rates for the first and second moments are β 1 = 0.9 and β 2 = 0.999, respectively, decay = 0, learning rate δ = 0.001 and x 0 = ω s for t = 1 to m do Randomly select a pointi t g t = ∇L i t (x t ) − ∇L i t (ω s ) + ∇L(ω s ) The significance of the proposed framework over existing ones is given as follows: 1) In the proposed method, the ensemble of MLPs reduces the variance by taking the average of the prediction results of each MLP.This makes the model more generalized.It helps generate diversified recommendations.
2) In the novel optimizer, AdaMVRGO, variance-reduced gradients are used to compute the first and second moments.This helps in reducing the variance, which helps in fast convergence.

RESULTS AND DISCUSSION
The research concerns addressed in this section are [68]: • RQ1: Does the proposed framework perform better than the baseline algorithms?• RQ2: Does the proposed optimizer perform well compared to the other existing optimizers?

Data description
This paper evaluated the proposed framework using three standard datasets: M-100K, M-1M, and M-10M, taken from Movielen's website [11].The M-100K dataset has 100,000 ratings from 943 users and 1682 movies [69,70].The rating values range from 1 to 5. The information was gathered from 19th September 1997 to 22nd April 1998.This dataset contains basic demographic information about the user, like age, gender, occupation, etc.Every user in this collection has reviewed a minimum of 20 movies, and the users who had fewer than 20 ratings are not considered in the dataset.Also, users whose demographic information is not present are excluded from this collection.The movies present in the dataset belong to any of the 19 genres: unknown, action, adventure, animation, children's, comedy, crime, documentary, drama, fantasy, film noir, horror, musical, mystery, romance, sci-fi, thriller, war, and western.
The M-1M dataset has 1,000,209 ratings for 3900 movies submitted by 6040 users collected during the year 2000 [11,71].Every user has reviewed at least 20 movies.The dataset consists of demographic information about the user, like gender, age, and occupation.The movies in the dataset fall under different genres like action, adventure, animation, children's, comedy, etc. [72].
The M-10M dataset has 1,00,00,054 ratings for 10,681 movies submitted by 71,567 users [11].This dataset uses a 5star scale with an increment of half a star.Unlike the previous datasets, no demographic information about the user is provided.The movies in the dataset fall under different genres like action, adventure, animation, children's, comedy, etc. [72].
Each user rates at least 20 movies [56].Table 1 shows the statistical analysis of all three datasets.

Evaluation metrics
The metrics used to evaluate the proposed model are MSE, RMSE, and MAE.MSE is the mean squared error obtained from the difference between the predicted and actual ratings, and RMSE is the square root of MSE [11].The formulae for MSE and RMSE are given in Eqs. ( 22) and (23).
MAE is the mean absolute difference between the predicted and actual ratings.The formula for MAE is shown in Eq. (24).
where,  ̂ is the predicted rating and   is the actual rating,  represent some observations.Smaller values of Eqs. ( 22), (23), and ( 24) mean higher accuracy.

Comparative analysis with the existing techniques (RQ1)
Using the M-1M and M-10M datasets, this section compares the proposed framework Deep CF-AdaMVRGO to six existing models [24][25][26][27][28][29] in terms of RMSE.The research findings proved that the proposed model performed better than the existing models with different hyperparameters [76], like epochs and batch size.To prove the proposed model is statistically significant, a paired t-test with a 0.05 level of significance is used, as shown in Section 5.4.3.The quantitative comparison results are depicted in Tables 2 and 3.The existing models for comparison are: • Multiviews NN [25] initially learned the low-dimensional user and item vectors separately.A feed-forward neural network (FFNN) is used to learn the nonlinear interactions between users and items for rating prediction [77].
• NCF [26] implemented a neural network-based matrix factorization model [78], and the loss was adjusted to a squared loss for rating prediction.
• SemRe-DCF [27] combines a rating matrix with movie plot text, and a denoising autoencoder is used for predicting the ratings.
• DPGMF [24] fused plot texts into ratings and proposed a deep-plot-aware generalised matrix factorization to predict missing ratings.
• DELCR [28] uses a DNN to train the user and item embeddings separately to make a latent factor model for collaborative filtering that is based on deep learning.
• DLFCF [29] uses a two-level deep learning model to learn about the rating matrix's hidden features.From Tables 2 and 3, it is evident that the proposed model has given a considerably reduced RMSE when compared with the existing methods using M-1M and M-10M datasets.While experimenting, it was also noted that the proposed algorithm showed better results on the M-1M dataset with 100 epochs and 128 batch size.Similarly, it showed better results on the M-10M dataset with 50 epochs and 256 batch size.
The significant benefit of the proposed model over the existing models [24][25][26][27][28][29] is that it allows the model to make accurate predictions even for new movies that have limited or no information available.Additionally, the proposed framework outperforms the existing models in terms of reducing biases and variances in the ensemble, which results in more reliable and accurate predictions.

Comparative analysis with the existing optimizers (RQ2)
This section evaluates the performance of the proposed optimizer AdaMVRGO on the ensemble framework with current optimizers such as Adagrad, RMSProp, ADAM, and SVRG on M-100K, M-1M, and M-10M datasets in terms of RMSE, MAE, and MSE metrics [11].Analysis of the M-100K dataset: Using 10-fold crossvalidation on the M-100K dataset, the proposed Deep CF-AdaMVRGO method is compared with other methods like Deep CF-Adagrad, Deep CF-RMSProp, Deep CF-Adam, and Deep CF-SVRG in terms of RMSE, MSE, and MAE metrics [78,79].Experiments were carried out with various settings [78,79].Experiments were carried out with various settings, such as a learning rate of 0.001, a batch size of 256, 100 epochs, and latent factors [8,16,32,64].Figure 3  The total number of computations is represented in floating point operations (FLOPs), and it is observed that the proposed algorithm took fewer FLOPs to reach the optimum when compared with other algorithms, as shown graphically in Figure 4.  [78,79].Experiments were carried out with various settings, such as a learning rate of 0.001, a batch size of 256, 100 epochs, and latent factors [8,16,32,64].Figure 5 [78,79].Experiments were carried out with various settings, such as a learning rate of 0.001, a batch size of 256, 100 epochs, and latent factors [8,16,32,64].Figure 7  The total number of computations is represented in FLOPs, and it is observed that the proposed algorithm took fewer FLOPs to reach the optimum when compared with other algorithms, as shown graphically in Figure 8. From Figures 3, 5, and 7, it is proven that the proposed novel optimizer, AdaMVRGO, outperformed the existing optimisation techniques on the ensemble framework in terms of all evaluation metrics [80].From Figures 4, 6, and 8, it can be observed that AdaMVRGO converged significantly faster than the existing optimization techniques.This suggests that AdaMVRGO not only produces superior results in terms of accuracy metrics but also does so in a more efficient manner.These findings highlight the effectiveness of AdaMVRGO in improving the performance of the ensemble framework and make it a promising choice for future applications.

Statistical significance and discussions
To verify the results of the proposed model over the existing models, statistical analysis was performed on the experimental results using a paired t-test for the M-1M and M-10M datasets.The statistical analysis is performed at a 0.05 (95%) level of significance, and the results are analysed in terms of a t-value.The proposed model is considered robust when the p-value is less than the significance level [81].From Table 4, we can conclude that the proposed model is significantly stronger with a low p-value compared to the existing models at 9 degrees of freedom.These findings suggest that the proposed model is highly reliable and provides a significant improvement over the existing models in predicting the observed data.Moreover, the low p-value indicates that the results are unlikely to occur by chance alone, further validating the proposed models' robustness [75].Overall, these findings offer compelling proof of the proposed model's efficacy and demonstrate that it outperforms existing models in terms of statistical strength and accuracy [82].

CONCLUSION
Deep collaborative filtering has proven to be one of the best recommendation techniques over the other latent factor models to capture hidden, nonlinear, and complex relationships between users and movies.To handle the sparsity issue and enhance movie recommendation accuracy, we developed an ensemble framework called Deep CF-AdaMVRGO, which uses three sub-models: M1, M2, and M3.The first two models (M1 and M2) of the ensemble use one hidden layer and two hidden layers without dropout, and the third model, M3, uses three hidden layers and a dropout rate taken from the Bernoulli distribution to avoid overfitting.The output of each MLP is trained using a novel optimization technique called AdaMVRGO.Taking the average of the individual predictions made by the models M1, M2, and M3 yields the final rating.
Experiments were conducted using the M-100K, M-1M, and M-10M datasets in terms of evaluation metrics like MSE, RMSE, and MAE.Simulation results showed that the new optimizer AdaMVRGO had a low error value in recommending movies, compared to optimizers like Adagrad, RMSProp, ADAM, and SVRG on the proposed ensemble architecture.Also, the proposed optimizer achieved the optimum with fewer FLOPs.The experimental results showed that the Deep CF-AdaMVRGO performed well over the benchmark algorithms on both the M-1M and M-10M datasets in terms of the RMSE value.Experiments were done with different parameters, and the results showed that the proposed model is better than the existing models on the M-1M dataset when the number of epochs was 100 and the batch size was 128.Also, the proposed model showed a 2.040%, 13.006%, 3.432%, 0.609%, 0.244%, and 5.773% decrease in terms of RMSE value over the baseline models.In the same way, it did better on the M-10M dataset when the number of epochs was 50 and the batch size was 256.The proposed algorithm did better than the baseline models in terms of RMSE value by 1.933%, 15.35%, 1.552%, 0.522%, 1.679%, and 4.636%.
From the simulations, we infer that the ensemble-based model can generate more accurate recommendations than the individual methods.In this paper, the final prediction is the mean of the predictions obtained from the three MLPs, which may not be the best all the time.Future enhancements may include the use of stacking ensembles to construct models with low bias and variance; the addition of side information to the movies like their genres, user reviews, implicit feedback, etc. to enhance the recommendation accuracy; and the use of an attribute selection algorithm to improve movie recommendation performance even further.
Funding: This research received no external funding.This work is carried out as part of my doctoral committee.

Conflict of Interest:
The authors declare no conflicts of interest.

Figure 3 .
Figure 3. Performance of the Deep CF-AdaMVRGO algorithm using M-100K (A to D from top to bottom) depicts the performance of the Deep CF-AdaMVRGO method over the other methods on the M-100K dataset with MAE, MSE, and RMSE for different latent factors.[3(A) compares the evaluation metric values of different optimizers when the latent factors are 8; 3(B) compares the evaluation metric values of different optimizers when the latent factors are 16; 3(C) compares the evaluation metric values of different optimizers when the latent factors are 32; 3(D) compares the evaluation metric values of different optimizers when the latent factors are 64].

Figure 4 .
Figure 4. Comparison of MFLOPs using the M-100K dataset Analysis of the M-1M dataset: Using 10-fold crossvalidation on the M-1M dataset, the proposed Deep CF-AdaMVRGO method is compared with other methods like Deep CF-Adagrad, Deep CF-RMSProp, Deep CF-Adam, and Deep CF-SVRG in terms of RMSE, MSE, and MAE metrics[78,79].Experiments were carried out with various settings, such as a learning rate of 0.001, a batch size of 256, 100 epochs, and latent factors[8,16,32,64].Figure5depicts the performance of the Deep CF-AdaMVRGO method over the other methods on the M-1M dataset with MAE, MSE, and RMSE for different latent factors.[5(A) compares the evaluation metric values of different optimizers when the latent factors are 8, 5(B) compares the evaluation metric values of different optimizers when the latent factors are 16, 5(C) compares the evaluation metric values of different optimizers when the latent factors are 32, 5(D) compares the evaluation metric values of different optimizers when the latent factors are 64].The total number of computations is represented in FLOPs, and it is observed that the proposed algorithm took fewer FLOPs to reach the optimum when compared with other algorithms, as shown graphically in Figure6.
depicts the performance of the Deep CF-AdaMVRGO method over the other methods on the M-1M dataset with MAE, MSE, and RMSE for different latent factors.[5(A) compares the evaluation metric values of different optimizers when the latent factors are 8, 5(B) compares the evaluation metric values of different optimizers when the latent factors are 16, 5(C) compares the evaluation metric values of different optimizers when the latent factors are 32, 5(D) compares the evaluation metric values of different optimizers when the latent factors are 64].The total number of computations is represented in FLOPs, and it is observed that the proposed algorithm took fewer FLOPs to reach the optimum when compared with other algorithms, as shown graphically in Figure6.

Figure 5 .Figure 6 .
Figure 5. Performance of the Deep CF-AdaMVRGO algorithm using M-1M dataset (A to D from top to bottom)

Figure 7 .
Figure 7. Performance of the deep CF-AdaMVRGO algorithm using M-10M dataset (A to D from top to bottom)

Figure 8 .
Figure 8.Comparison of MFLOPs using the M-10M dataset

Table 2 .
Quantitative comparison results in terms of RMSE using the M-1M dataset (A lower RMSE value is better.)

Table 3 .
Quantitative comparison results in terms of RMSE using the M-10M dataset (A lower RMSE value is better.)