A Novel Restricted Boltzmann Machine Training Algorithm with Fast Gibbs Sampling Policy

. The restricted Boltzmann machine (RBM) is one of the widely used basic models in the ﬁeld of deep learning. Although many indexes are available for evaluating the advantages of RBM training algorithms, the classiﬁcation accuracy is the most convincing index that can most eﬀectively reﬂect its advantages. RBM training algorithms are sampling algorithms essentially based on Gibbs sampling. Studies focused on algorithmic improvements have mainly faced challenges in improving the classiﬁcation accuracy of the RBM training algorithms. To address the above problem, in this paper, we propose a fast Gibbs sampling (FGS) algorithm to learn the RBM by adding accelerated weights and adjustment coeﬃcient. An important link based on Gibbs sampling theory was established between the update of the network weights and mixing rate of Gibbs sampling chain. The proposed FGS method was used to accelerate the mixing rate of Gibbs sampling chain by adding accelerated weights and adjustment coeﬃcients. To further validate the FGS method, numerous experiments were performed to facilitate comparisons with the classical RBM algorithm. The experiments involved learning the RBM based on standard data. The results showed that the proposed FGS method outperformed the CD, PCD, PT5, PT10, and DGS algorithms, particularly with respect to the handwriting database. The ﬁndings of our study suggest the potential applications of FGS to real-world problems and demonstrate that the proposed method can build an improved RBM for classiﬁcation.


Introduction
Deep learning (DL) is an important branch of machine learning that employs complex model structures or different nonlinear transformation methods to conduct high-dimensional abstract feature modeling of data [1]. DL structures such as the deep neural network, convolutional neural network, and generative adversarial neural network have been widely used in computer vision, automatic speech recognition, natural language processing, audio recognition, and bioinformatics [2]. Deep belief network (DBN) is a generative graphical model, which can be regarded as a type of deep neural network [3] that can perform classification tasks excellently [4]. DBNs are stacked with restricted Boltzmann machines (RBMs), and the whole network can be trained by a greedy layer-wise learning algorithm from the bottom to the top [5]. erefore, the effect of RBM training will directly affect the quality of the DBN.
RBM is an important model employed in DBNs and is one of the widely used models of the Markov random field (MRF). It was proposed by Smolensky in 1986 based on the Boltzmann machine (BM) [6]. e RBM is a stochastic generative neural network that can learn the probability distribution from the input datasets. RBMs are applied in dimension reduction [7], classification [8], collaborative filtering [9], feature learning [10], topic modeling [11], radar target automatic recognition [12], chip synthesis [13], and speech recognition [14]. RBMs can be trained by either supervised or unsupervised learning depending on the different tasks. DL and RBM have obvious advantages for massive unlabeled data, particularly in the context of big data. e contrastive divergence (CD) algorithm proposed by Hinton et al. in 2002 is one of the most popular standard algorithms for RBM training [15]. It uses K step sampling chain to approximate the target gradient.
e CD algorithm initializes the Gibbs chain samples from the training set samples. In general, one-step Gibbs sampling (sampling step K = 1) is sufficient to acquire an impressive result strategy. However, as the classification accuracy of the CD algorithm is not optimal, many refined learning algorithms for RBM training have been introduced, which adopt different sampling techniques and aim to reduce the bias in the gradient approximation [16]. e persistent contrastive divergence (PCD) algorithm proposed by Tieleman in 2008 is also based on Gibbs sampling. However, instead of using the training data to reinitialize the sampling chain in each iteration, PCD employs the results of the previous sampling for continuous operation [17]. To accelerate the PCD algorithm, the fast persistent contrastive divergence (FPCD) algorithm was proposed by Tieleman et al. in 2009 [18], with the introduction of additional fast parameters to improve the sampling speed. In 2010, Desjardins et al. [19] and Cho et al. [16] proposed the parallel tempering algorithm (PT) based on sampling technology. e PT algorithm was designed to overcome the limitations of traditional sampling algorithms in multimode target distribution. Experiments have shown that the PT algorithm can achieve better generative models for RBM training while effectively preventing any possible divergence. e number of hidden units can also be changed for better performance [20]. In recent years, improvements in RBM algorithms have been mushrooming. Li et al. proposed the dynamic Gibbs sampling (DGS) algorithm in 2016, which can effectively overcome the problem of sampling divergence and obtain a higher training accuracy at the cost of a short computation time [21].
At present, the training of RBMs is primarily based on Gibbs sampling. Effective improvement in the training of RBMs has proved challenging and has been an active focus area of algorithmic improvements. To solve this problem, the fundamentals of RBMs and classification accuracies of traditional algorithms are introduced in Section 2. In Section 3, theoretical analysis is performed and the important relationship between the update of network weights and mixing rate of Gibbs sampling chain is determined based on Gibbs sampling theory. Two important conclusions are deduced according to Gibbs sampling theorem. In Section 4, we first focus on the relationship between the weights of the classical algorithm and mixing rate based on Gibbs sampling theory while attempting to improve Gibbs mixing rate. e accelerated weights and adjustment coefficient are then added based on the original weight, and the fast Gibbs sampling (FGS) algorithm is proposed, with the advantages of the improved algorithm proved theoretically. Finally, in Section 5, the results of numerous simulation experiments are analyzed in detail.
e experimental results indicate that the proposed fast Gibbs sampling algorithm achieves a good stationary distribution of the RBM model in a short time and accelerates the mixing rate of Gibbs sampling chain in the sampling process, thus significantly improving the classification accuracy of the RBM. e results of this study provide the theoretical guidance for improving the RBM training algorithm in future studies.

RBM Model and Training Strategy. RBM is a typical
Markov random field model, which has a two-layer network structure. e network consists of visible units and hidden units, which correspond to visible variables (data samples) and hidden variables, respectively [22]. Binary RBM is a type of RBM wherein the variables of the visible and hidden layers are binary variables. In this paper, we will introduce and employ the binary RBM as an example. e entire RBM network is a bipartite graph, in which connections only exist between the visible and hidden units. ere is no connection within the visible units and hidden units. Owing to this condition, RBM has the following properties: given the state of the visible layer neurons, the activation conditions of the neurons of each hidden layer are independent, and vice versa. e structure of an RBM is shown in Figure 1.
e training process of RBMs [23] involves the following steps: Step 1 (modeling). Modeling includes the energy function and maximum probability hypothesis. RBMs are energybased models, with maximum likelihood as the learning goal.
e energy of the joint configuration of its visible variables v and hidden variables h is given by θ represents the parameter W, a, b { }. e joint probability of v and h can be obtained by Here, Z(θ) is called the partition function. e above equation can also be written as e objective is to maximize the likelihood function P(v). P(v) can be calculated by the edge distribution of P(v, h): Step 2 (parameter estimation). e maximum likelihood estimate (MLE) method is used for parameter estimation. P(v) is maximized to obtain the RBM parameters. To maximize P(v), we maximize log(P(v)) � L(θ): Step 3 (extreme value solution). e gradient descent algorithm is mainly used to find the extreme value solution.

Mathematical Problems in Engineering
First, L(θ) is maximized via stochastic gradient descent.
Subsequently, the L(θ) ′ derivative for W is estimated: e first half of the formula can be calculated easily. e average values of v i and h j from all of the datasets are calculated. However, the latter half of the equation involves all 2 |v|+|h| combinations of v and h, which is difficult to solve owing to the computational complexity. e latter half of the formula is equal to As the analytical solution of the gradient cannot be computed, the existing training algorithms are mainly based on the sampling method. First, a Markov chain with a stationary distribution P(v) is constructed to obtain the samples satisfying the distribution. e gradient is then approximated via Monte Carlo simulations: where v (0) i is the sample value and v (k) i is the sample that satisfies the distribution P(v) obtained by sampling. Finally, the parameter update equation is given by 2.2. Classification Accuracy. Many indicators are available for evaluating the performance of RBM training algorithms. Different performance indicators are selected according to different requirements: indicators can represent (1) convergence errors and convergence accuracy based on the convergence characteristics of the Gibbs sampling chain, (2) computation time and algorithm efficiency based on the algorithm complexity, and (3) classification accuracy compared with the actual application of the RBM. e convergence characteristics of the Gibbs sampling chain can represent the degree of convergence of the training algorithm to a certain extent. It can only be used to assess the training algorithm qualitatively, but it is not an accurate indicator for comparing the training algorithm [22]. e training duration is directly proportional to the complexity of the training algorithm. e simpler the training algorithm is, the shorter the training duration would be. Although the training time is short, a high classification accuracy is not achieved in the final application. erefore, the training time can serve as an auxiliary index to evaluate the training algorithm. is implies that indicators that are more convincing are necessary. As a composite index, the algorithm efficiency depends on the negative value of the algorithm running time and algorithm training error product [24]. Decreasing either the running time or training error can improve the algorithm efficiency. e algorithm efficiency can also be used as an auxiliary indicator to assess the quality of the algorithm. e classification accuracy is the most intuitive and convincing indicator for evaluating the RBM training algorithm. e RBM trained by the algorithm is introduced into the actual dataset for testing. e accuracy of the test results represents the number of successfully classified data divided by the total number of data.
In this paper, four traditional RBM training algorithms were simulated using the standard MNIST handwritten dataset. e four algorithms are CD with classic single Gibbs chain sampling, PCD that improves the initial value of each Gibbs chain sampling on the CD algorithm, and PT (PT5 and PT10) for processing multiple Gibbs chain samples with multimode problems. e classification accuracy of the four algorithms is shown in Table 1.
As shown in Table 1, the CD algorithm has a relatively low classification accuracy, while that of PCD is higher by approximately 2%, leading to improved classification. However, the PT algorithm, which addresses multimode problems better, has approximately the same classification accuracy as the CD algorithm when evaluating handwritten datasets. As the bias vectors of both the visible and hidden layers have minimum values in the RBM network, their update relative to the update of the network weights is negligible. erefore, only the influence of the network weight W is studied when evaluating the network parameters. Figure 2 shows the updating of the network weights of the four traditional RBM algorithms. Figure 2 clearly shows that the PCD algorithm follows the process of increasing and attenuating the weights after changing the initial value of each Gibbs sampling chain, before finally maintaining a larger value. In contrast, the CD algorithm exhibits a relatively small change in the weight and is in an increasing state until the end of the algorithm. In this paper, any change in the weights generally refers to the absolute change in the weights. e PT algorithm shows a short increase in the weight before it decays and converges to a smaller value. As RBM training algorithms are essentially based on Gibbs sampling, the convergence of Gibbs sampling chain, i.e., the sample mixing rate of Gibbs sampling chain, is the primary factor affecting the RBM training algorithm. Based on the classification accuracies of each algorithm presented in Table 1 and weight update trends of each algorithm shown in Figure 2, we primarily investigated the relationship between the update of the RBM network parameters and mixing rate of Gibbs sampling chain, to provide the theoretical support for improving the classification accuracy of the RBM training algorithm.

Convergence Theorem for Gibbs Sampling
It is well established that the classification accuracy index can directly reveal the performance of the RBM training algorithm. e performance of the RBM training algorithm based on Gibbs chain sampling depends on the sampling property of Gibbs chain, i.e., the mixing rate of the Gibbs sampling chain. In this paper, we analyze the relationship between the update of the network weight and mixing rate of Gibbs sampling chain. Furthermore, this indirectly analyzes the relationship between the mixing update of the RBM network weight and performance of the RBM training algorithm, to assess the final classification accuracy.

Theorem 1.
e mixing rate of Gibbs sampling chain gradually decreases as the randomness of the transfer operator decreases [25].
e RBM network represents a typical two-layer network structure with no connections within the layer and full connections between the layers. Owing to its special network structure, the conditional independence within the layers of the RBM indicates that the output layer node value satisfies the following conditional probability when the input layer data are provided: Correspondingly, when the output layer data are determined, the conditional probability of the input layer node value is e RBM network is based on Gibbs iteration of the above two equations to update the network parameters. As a transfer operator function of mixing with Gibbs sampling, the randomness of the sigmoid function will affect the mixing rate of the entire Gibbs sampling chain. To study the randomness of the transfer operator, sigmoid curves under different weights are provided in Figure 3. In this paper, any change in the weight generally refers to the absolute change in the weight. e sigmoid curves become steeper with an increase in the network weight. e following two conclusions can be drawn:

Corollary 1. When the variation range of X approaches infinity, the sigmoid function gradually becomes less random with an increase in the weights.
Derivation: the sigmoid function diagram shows that when the variation range of X is large and approaches infinity, a larger weight results in a steeper curve. If the curve is sampled during this interval, the value of most sampling areas transferred by the transfer operator will be either 0 or 1, and the modified region, i.e., the adjustable region of the curve in the plot, is very small. us, there is a very small change in state before and after sampling, indicating that the randomness of the transfer operator is very small.

Corollary 2.
When the variation range of X approaches zero, the sigmoid function gradually becomes increasingly random with an increase in the weight.  Derivation: the schematic of the sigmoid function shows that when the variation range of X is small and approaches 0, a smaller weight results in a flatter curve. Furthermore, the variable sampling area is very small, which results in a very small change in state before and after sampling. e curve becomes increasingly steeper with increase in the absolute value of the weights. erefore, the variation range of the transition after sampling is larger, and there is a larger change in state before and after sampling, indicating that the randomness of the transfer operator is also very large.

Analysis Based on Gibbs Sampling eory.
In this section, we will analyze the classical Gibbs sampling algorithm based on Gibbs sampling theory to identify the relationship between the mixing rate of the Gibbs sampling chain and network weights. Unless otherwise specified, all weights referred to in this paper are the mean values of the network weights during each iteration. Figure 2 shows that the absolute value of the weight of the CD algorithm increases with the number of iterations. e weights corresponding to different iterations were equally spaced, and a schematic of the sigmoid function was drawn corresponding to the different weight parameters, as shown in Figure 4. To show the transfer randomness of the sigmoid function under different weights clearly, the image in Figure 4 was enlarged, as shown in Figure 5. Figure 2 shows that the absolute value of the weight of the CD algorithm increases with an increase in the number of iterations when training the RBM. Figure 5 and Corollary 2 indicate that the randomness of the sigmoid function continually increases, while the randomness of the transfer operator corresponds to the mixing rate of the Gibbs chain. However, at the end of training (i.e., when the number of iterations n � 1000), the transfer randomness of the sigmoid function is only ∼0.72. e following conclusion can be drawn considering the classification accuracy of the CD algorithm presented in Table 1 (Section 2.2). Although the absolute value of the weight of the CD algorithm increases continuously with the network training, the randomness of the sigmoid transfer operator continually increases, as the maximum transfer probability of the transfer operator is only ∼0.72, which limits the Gibbs sampling chain. e mixing rate eventually achieves a classification accuracy of only ∼85% for the CD algorithm.

Gibbs Chain Mixing Rate Analysis of the PCD
Algorithm. e PCD algorithm changes the initial value of the Gibbs chain for each sample based on the CD algorithm. Figure 2 shows that the weight of the PCD algorithm is significantly different from that of the CD algorithm. First, the absolute value of the weight of the PCD algorithm initially increases and then slowly decreases until the end. In this paper, the weights corresponding to different iteration times were equally spaced, and the corresponding sigmoid function for different weights was plotted, as shown in Figure 6. A part of Figure 6 was enlarged to clearly show the transition probability of the sigmoid function under different weights (Figure 7). By comparing the transfer operators of the sigmoid function corresponding to different weights and iterations (Figure 7), it can be observed that the randomness of the transfer operator of the PCD algorithm increases rapidly with the training, with the probability increasing rapidly from 0.79 to 0.94. e randomness then slowly decreases and remains at ∼0.92. Combined with the classification accuracy of the PCD algorithm (Table 1, Section 2.2), the following conclusions can be drawn: compared with the CD algorithm, the absolute value of the weight of the PCD algorithm increases rapidly at the beginning of training, then slowly decreases, and remains stable in the later stages, accompanied by an initial increase in the transition probability. e rate of increase is very fast, although it can remain at ∼0.92 in the later stage, indicating that the PCD algorithm is more random than the transfer operator of the CD algorithm is. e mixing ratio of the Gibbs chain is also higher than that of the CD algorithm. erefore, the classification accuracy of the PCD algorithm (87.46%) is also considerably higher than that of the CD algorithm (85.80%).

Gibbs Chain Mixing Rate Analysis of the PT Algorithm.
In addition to the single-chain Gibbs sampling algorithm, there are some typical multichain Gibbs sampling algorithms, such as PT5 and PT10. In this paper, the PT5 algorithm is introduced as an example. e change in the weight of the multichain Gibbs sampling algorithm has an important impact on the Gibbs mixing rate. Figure 2 shows that the absolute weights of the PT algorithms (PT5 and PT10) increase slowly with the number of iterations and then slowly decrease. e weights corresponding to different iterations were selected according to the same spacing, and the sigmoid function under the corresponding weights was plotted, as shown in Figure 8. A part of Figure 8 was enlarged to show the transition probability of the sigmoid function under different weights clearly (Figure 9). Based on Figure 9 and Corollary 2, it can be seen that the randomness of the sigmoid function of the PT algorithm first increases and subsequently decreases. e change in the randomness of the transfer operator corresponds to the mixing rate of the Gibbs chain that first increases and then decreases. At the end of training, the transfer probability of the sigmoid function is stable at ∼0.34. Considering the classification accuracy of PT (Table 1, Section 2.2), the following conclusions can be drawn: although the PT algorithm can address multimode problems better than the CD algorithm, the randomness of the sigmoid transfer operator remains low due to less changes in its weight with network training, which highly limits the mixing rate of the Gibbs sampling chain. According to eorem 1, the classification accuracy of the PT5 algorithm should be considerably lower than that of the CD algorithm (85.80%). However, as the PT algorithm involves a temperature mixing chain exchange to improve the mixing rate of the multichain Gibbs sampler [17], the classification accuracy of the PT5 algorithm is ∼84.10%, which is slightly lower than that of the CD algorithm. can be seen that the core factor of the RBM training algorithm based on Gibbs sampling is the convergence property of the Gibbs sampling chain, i.e., the sample mixing rate. e change in the network weights is an important factor affecting the mixing rate of Gibbs chain sampling. From Corollary 2, it can be seen that the mixing ratio of the PCD algorithm improves owing to an increase in the network weights; however, the improved weighing of the PCD algorithm depends on the increase in weights proposed in Corollary 2, with the probability of the transfer operator not reaching 1. In this paper, the aim was to further improve the mixing rate of the algorithm through Corollary 1, to improve the classification accuracy.

Accelerated Weight and Adjustment Coefficient.
Based on the theoretical deduction of the above analysis, it can be established that the classical algorithm can be improved by improving the weight of convergence theorem 2 of Gibbs sampling. Although there is significant improvement in the mixing rate of the Gibbs sampling chain of the improved algorithm and in the classification accuracy by 2%, we found that the probability of the transfer operator of the improved PCD algorithm was only 0.94. erefore, the aim was to modify the weight update based on the classical CD algorithm to improve the mixing rate of the Gibbs sampler. e weight update should satisfy the following two criteria: (1) e update of weights must increase rapidly in the initial stage. According to Corollary 2 derived from the convergence theorem of Gibbs sampling, a rapid increase in the weights at the beginning is beneficial for increasing the randomness of the transfer operator, leading to a rapid increase in the mixing rate of the Gibbs chain with the update of the weights. (2) e update of weights should show a decreasing trend in the mid and late stages of the training period. As the weight increases, the corresponding X interval of the transfer operator will increase accordingly. According to Corollary 1 derived from the convergence theorem of Gibbs sampling, when the transfer probability of the transfer operator is 1, the adjustable domain of the transfer operator continually decreases if the absolute value of the weight increases upon sampling (Figure 3). is implies that the mixing rate of Gibbs chain decreases. erefore, if the absolute value of the weight reduces after the transition probability of the transfer operator reaches 1, the weighting of the transfer operator will increase its unable domain. is will further improve the mixing rate of the Gibbs chain, thereby improving the performance of the algorithm.
Based on the above analysis, this paper proposes a fast Gibbs sampling algorithm that introduces an accelerated weight fastW and adjustment coefficient ξ. e accelerated weight introduced in this paper is updated in the same way as the traditional weights. e accelerated weight update also depends on the difference between the expectation of data and the expectation of model. However, the weight of the entire network is the sum of the accelerated and traditional weights, which effectively ensures that the weight update increases rapidly in the early stage of training. e adjustment coefficient added in this study can effectively change the accelerated weight update rate. e adjustment coefficient can be varied in the range of 0 to 1, which can effectively decrease the trend of accelerated weight updating in the mid and later stages of training. By introducing the accelerated weight and adjustment coefficient, the proposed FGS algorithm can improve the transition probability of the transfer operators and mixing rate of the Gibbs chains, thus improving the classification accuracy of the algorithm. e equation used for updating the weight is given as follows.
Similar to the traditional RBM training algorithm, the input data v is sampled to obtain the hidden layer data. Here, v + � v − , both of which represent the input data: After calculating the hidden layer data, the positive and negative gradients are updated: en, the new input layer data can be calculated by the hidden layer data. e weights are the sum of the traditional and accelerated weights: e gradient for the weight updates is composed of two parts: traditional weight update and accelerated weight update: e traditional and accelerated weight updates are governed by the equations shown below. e accelerated weight update in this paper is controlled by the adjustment coefficient ξ, and the trend change can be effectively controlled by the adjustment.
Based on the above equations and algorithm description, the pseudocode of the FGS algorithm proposed in this paper is provided in Algorithm 1.

Simulations and Analysis
e simulations and analysis were performed in two steps. e first step involved the design of the experiment, which includes the design details of each group of experiments. e second step involved the display and analysis of the simulation results.

Experimental Design.
e experimental design mainly involves the following three steps: (1) details of each dataset used in the experiments, (2) setup of the RBM network structure and parameters in the simulation experiment, and (3) the design of the comparative experiment.  Figure 14).

Network Structure and Parameters.
e RBM network structure designed in this study and initializations of each network parameter in different datasets are shown in Tables 2-6. Here, Str represents the RBM structure in different datasets, a is the bias of the visible layer, b is the bias of the hidden layer, λ is the learning rate, W represents the traditional weight, and fastW represents the proposed fast weight.

Design of the Comparative Experiment.
To comprehensively compare the FGS algorithm with other traditional RBM training algorithms, the CD and PCD algorithms were selected owing to their better sampling. We also compared FGS with the PT5 and PT10 algorithms with respect to the multimode distribution problem. e experimental parameters of the six algorithms are shown in Table 7. Here, M represents number of iterations of the PT algorithm and K is the number of Gibbs sampling iterations, Batch represents the training sample size of a single batch, and Iter represents the number of network iterations.
(1) Experiment 1: Mixing Rate of Gibbs Chain. As the focus of this paper was to improve the RBM training algorithm based on Gibbs chain sampling, the convergence property of Gibbs sampling chain is a crucial indicator of the improved performance of the algorithm. e convergence property of Gibbs sampling chain depends on its mixing rate. e mixing rate of the Gibbs chain can be measured by the randomness of the transfer operator using eorem 1. Here, MNIST was used as the test dataset and the randomness of the transfer operator of the FGS algorithm was evaluated.
e experimental results and analysis are presented in 5.2.1.
(2) Experiment 2: Effect of Sampling. Considering the MNIST dataset as an example, the algorithm based on Gibbs sampling was used to reconstruct the data after sampling [22]. e reconstructed graphs of each training algorithm were saved and compared with the original image to observe the performance of each algorithm intuitively. e experimental results and analysis are presented in 5.2.2.

(3) Experiment 3: Classification Accuracy.
A key assessment of the effectiveness of an RBM training algorithm involves the comparison of the classification accuracy of the network. Considering the MNIST dataset as an example, the softmax classifier ( Figure 15) was added to the output layer of the trained RBM network, and the data of the hidden layer of the RBM network was used as the input to train the classifier. e output of the top layer ranged from 0 to 9, covering ten categories. e following process was adopted: the corresponding binary output was obtained via the trained network parameters following the input of a new set of binary data. ese binary outputs obtained through the RBM served as the inputs of the classifier, and they were trained in accordance with the labels of each set of data. Subsequently, the test data were inputted to the network and the classification accuracy was determined. Softmax is a generalization of the logistic regression model   , h), training batch S Output: gradient approximation ΔW, Δa i , Δb j for i � 1, . . ., m, j � 1, . . ., n (1) Init ΔW � Δa i � Δb j � 0 (2) Init W to small regular values, init fastW to all zeros for j � 1, . . ., n do sample (7) h (t) end (9) for j � 1, . . ., n do sample (10) end (15) for j � 1, . . ., n, i � 1, . . ., m do (16)   is dataset contains 19,440 training sets and 24,300 test sets [27], wherein each sample is a grayscale image with 32 × 32 pixels. Figure 12: e CIFAR10 dataset, which contains color maps with 32 × 32 × 3 pixels. We used 10,000 training samples and 10,000 test samples of this dataset [28]. Figure 13: e CIFAR100 dataset, which contains color maps with 32 × 32 × 3 pixels. e CIFAR100 dataset used in this paper consists of 10,000 training sets and 10,000 test sets [28].       (5) Experiment 5: Adjustment Coefficient. In this paper, the adjustment coefficient ξ was added to the designed FGS algorithm to adjust the fast weight. e adjustment coefficient was varied to analyze its effect on the whole network. In this experiment, MNIST was considered as the example dataset. e adjustment coefficient values are shown in Table 8, and the corresponding experimental results and analysis are detailed in 5.2.5.

Experimental Results and Analysis.
According to the comparative experiment designed in Section 5.1.3, the following section presents the corresponding experimental results and analysis based on the experimental design. Figure 16 shows the network weights of the FGS algorithm proposed in this paper. e results showed that the fast weight and adjustment coefficient added in this study could help analyze the weights detailed in Section 4 of this paper. In the early stage of training, the absolute value of the weight of the FGS algorithm increased rapidly, while it decreased gradually in the mid-later stages of training. e weight parameters corresponding to different iterations were selected based on an equal interval, with the corresponding sigmoid function shown in Figure 17. To show the transition probability of the sigmoid function clearly, a part of the image in Figure 17 was enlarged, as shown in Figure 18. Figure 17 shows that, in the initial stage of network training, the absolute value of the weight increased rapidly, which caused the transition probability of the transfer operator to increase rapidly from 0.8 at 200 steps to 1 at 400 iterations, resulting in a high probability value. According to Corollary 1 of the convergence theorem for Gibbs sampling, the randomness of the transfer operator tends to decrease if the absolute value of the weight continues to increase during this interval. However, after comprehensively considering the relationship between the weight and randomness of the transfer operator, the FGS algorithm proposed in this paper can solve the above-mentioned problem by adding the adjustment coefficient of the fast weight. Figure 16 shows that the absolute weight of the FGS algorithm decreased in the mid to later stages of training, synonymous with conclusions 1  and Figure 17 of the Gibbs sampling convergence theorem. In the mid and later stages of training, although the probability of the transfer operator reached the highest value of 1, as the absolute value of the weight decreased, the curve of the transfer operator became relatively flat and its adjustable domain increased. erefore, the randomness of the transfer operator continued to increase between 400 and 1000 iterations. us, FGS algorithm has a significantly higher Gibbs mixing rate compared with the CD algorithm.

Effect of Sampling.
A more intuitive criterion for evaluating the quality of RBM training involves the image reconstructed by the sampling algorithm. Figure 19 shows     algorithms.
e figures show that the network images reconstructed using the FGS algorithm proposed in this paper exhibited less noise and high reconstruction precision, indicating the better training effect of FGS compared with the other algorithms.

Classification Accuracy.
One of the most important criteria for assessing the effectiveness of RBM training is the classification accuracy with respect to the input dataset. Table 9 shows a comparison between the classification accuracies of the six algorithms for the five datasets.
(1) Analysis of the Classification Accuracy. e FGS algorithm proposed in this paper performed very well on the five commonly used datasets, namely, MNIST, MNORB, CIFAR10, CAFAR100, and USPS. Using the same RBM network structure and experimental parameters, FGS achieved the highest classification accuracy of 90.02% on the MINISTdataset, demonstrating that the improved algorithm could accurately identify the handwritten dataset. Second, the traditional RBM training algorithms, CIFAR10 and CIFAR100, did not effectively recognize the datasets with color images, as feature extraction is relatively difficult owing to these images having many pixels. e improved FGS algorithm achieved good recognition accuracy for these two colorful datasets. Subsequently, the classification accuracy was compared for the grayscale images in the MNORB dataset. e FGS algorithm achieved the best classification results in this case under the same experimental conditions. Finally, all algorithms were tested on the USPS dataset, which resulted in the proposed FGS achieving the highest classification accuracy. In summary, under the same experimental conditions and RBM network structure, the designed FGS algorithm designed could achieve an improved classification accuracy. (2) Analysis of the Training Time Comparisons. Two graphs were constructed to compare the training times for each dataset. e plots on the left compare the training times of all six algorithms for each dataset. It can be clearly seen that the training time of the PT5 and PT10 algorithms was longer than that of the other four algorithms. is is because PT algorithms have multiple temperature chains, with complex and frequent interactions increasing the time. erefore, a comparison between the training times of the remaining four algorithms is shown in the figures on the right. Based on these figures and Table 10, the following conclusions can be drawn: under the same experimental conditions, the CD algorithm can be trained in the shortest time, as it is the simplest implementation of Gibbs sampling. As the PCD algorithm only changes the initial value of each iteration of Gibbs sampling, it mostly requires a similar time to train as CD. Although the FGS algorithm adds the accelerated weight and adjustment coefficient compared with the CD algorithm, it is significantly less complex than the multichain PT algorithm. erefore, this indicates that the FGS algorithm has a shorter training time compared with the PT5 and PT10 algorithms. e DGS algorithm is a phased dynamic Gibbs sampling algorithm. As the training progressed, the number of Gibbs sampling steps increased to ten. Multiple iterations of a single sampling chain improved the recognition accuracy at the cost of increased time. FGS required   less time than DGS in most cases, despite being approximately 0.3 s slower with respect to the USPS dataset. Based on the results presented in 5.2.3, the accuracy of the FGS algorithm was considerably higher than that of the CD algorithm. is indicates that the improved FGS algorithm proposed in this paper can achieve a significant improvement in the accuracy relative to the CD algorithm at the cost of sacrificing a small amount of time, thus demonstrating a   significant improvement in the mixing rate of Gibbs sampling.

Adjustment
Coefficient. e designed FGS algorithm included the adjustment coefficient ξ with fast weights. We conducted a comparative experiment on the MNIST handwritten dataset to investigate the influence of the adjustment coefficients on training the RBM network. Based on the experimental parameters presented in Table 6, different values were selected for different ranges of the adjustment coefficient, with the reconstruction error chosen as the evaluation index. Figure 36 shows a comparison between the reconstruction errors of each group of experiments, with the corresponding recognition accuracies shown in Table 11. e reconstruction error is the difference between the original data and training samples following a Gibbs transfer through the RBM distribution [22]. Assuming that the minibatch data size is the batch size, v 1 is the initial state, and v 2 is the reconstructed state after one sampling interval, then the equation for calculating the reconstructed error is given by In Figure 36, it can be observed that when different values are selected, the reconstruction error gradually converges with the number of iterations, indicating that the choice of the adjustment coefficient does not affect the convergence of the entire network.         Table 11 shows that the classification accuracy of the entire algorithm is relatively high when the adjustment coefficient is approximately 0.9, indicating that the mixing rate for Gibbs sampling is the highest for the whole network. Based on the data in Table 8, it can be seen that the selection of the adjustment coefficient would affect the accuracy of the algorithm. e recognition accuracy of the whole network will be relatively high when the adjustment coefficient is in the range 0.9-0.95.

Conclusion
e aim of this study was to improve the classification accuracy of the RBM training algorithm by formulating an analysis based on Gibbs sampling theory. e important relationship between the update of the network weights and mixing rate of Gibbs sampling chain was established, resulting in two important conclusions based on the convergence theorem of Gibbs sampling. Experimental analysis revealed that the mixing rate of Gibbs chain can be improved for classical algorithms. In this study, the accelerated weights and adjustment coefficient of the fast Gibbs sampler were included to improve the update of the training weights. Furthermore, the proposed FGS algorithm improved the mixing rate of Gibbs sampling chain and controlled the training time. Compared with the classic RBM training algorithm, the proposed FGS algorithm achieved a higher classification accuracy. In this paper, five control trials were simulated using five experimental datasets (MNIST, CIFAR10, CIFAR100, MNORB, and USPS), which were assessed based on the mixing ratio of Gibbs chain, sampling comparison, classification accuracy, training duration, and hyperparameters. e experimental results showed that the proposed FGS algorithm had good sampling convergence properties. FGS can reduce the training time, achieve a higher classification accuracy than the classical algorithms, including CD, PCD, PT5, PT10, and DGS, and improve the mixing rate of Gibbs sampling chain. Furthermore, based on the experimental results, it can be concluded that the proposed FGS algorithm has a certain universality.

Conflicts of Interest
e authors have no conflicts of interest to declare.