Restricted Boltzmann Machine with Multivalued Hidden Variables

Yokoyama, Yuuki; Katsumata, Tomu; Yasuda, Muneki

doi:10.1007/s12626-019-00042-4

Restricted Boltzmann Machine with Multivalued Hidden Variables

A Model Suppressing Over-Fitting

Article
Open access
Published: 27 June 2019

Volume 13, pages 253–266, (2019)
Cite this article

Download PDF

You have full access to this open access article

The Review of Socionetwork Strategies Aims and scope Submit manuscript

Restricted Boltzmann Machine with Multivalued Hidden Variables

Download PDF

2760 Accesses
6 Citations
11 Altmetric
Explore all metrics

Abstract

Generalization is one of the most important issues in machine learning problems. In this study, we consider generalization in restricted Boltzmann machines (RBMs). We propose an RBM with multivalued hidden variables, which is a simple extension of conventional RBMs. We demonstrate that the proposed model is better than the conventional model via numerical experiments for contrastive divergence learning with artificial data and a classification problem with MNIST.

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Article Open access 31 March 2021

Laith Alzubaidi, Jinglan Zhang, … Laith Farhan

A survey on Image Data Augmentation for Deep Learning

Article Open access 06 July 2019

Connor Shorten & Taghi M. Khoshgoftaar

Development and Application of Artificial Neural Network

Article 30 December 2017

Yu-chen Wu & Jun-wen Feng

1 Introduction

Generalization is one of the most important goals in statistical machine learning problems [1]. In various standard machine learning techniques, given a particular data set, we fit our probabilistic learning model to the empirical distribution (or the data distribution) of the data set. When our learning model is sufficiently flexible, it can fit the empirical distribution exactly via an appropriate learning method. A learning model that is too close to the empirical distribution frequently gives poor results for new data points. This situation is known as over-fitting. Over-fitting impedes generalization; therefore, techniques that can suppress over-fitting are needed to achieve good generalizations. Regularizations, such as $L_1$ and $L_2$ regularizations or their combination (the elastic net) [18], are popular techniques used for this purpose.

Here, we focus on a restricted Boltzmann machine (RBM) [5, 15]. RBMs have a wide range of applications such as collaborating filtering [14], classification [8], and deep learning [6, 12, 13]. The suppression of over-fitting is also important in RBMs. An RBM is a probabilistic neural network defined on a bipartite undirected graph comprising two different layers: visible layer and hidden layer. The visible layer, which consists of visible (random) variables, directly corresponds to the data points, while the hidden layer, which consists of hidden (random) variables, does not. The hidden layer creates complex correlations among the visible variables. The sample space of the visible variables is determined by the range of data elements, whereas the sample space of the hidden variables can be set freely. Typically, the hidden variables are given binary values ($\{0,1\}$ or $\{-1,+1\}$).

In this study, we propose an RBM with multivalued hidden variables. The proposed RBM is a very simple extension of the conventional RBM with binary-hidden variables (referred to as the binary-RBM in this paper). However, we demonstrate that the proposed RBM is better than the binary-RBM in terms of suppressing the over-fitting. The remainder of this paper is organized as follows. We define the proposed RBM in Sect. 2 and explain its maximum likelihood estimation in Sect. 2.1. In Sect. 2.2, we demonstrate the validity of the proposed RBM using numerical experiments for contrastive divergence (CD) learning [5] with artificial data. We give an insight on the effect of our extension (i.e., the effect of multivalued hidden variables) using a toy example in Sect. 2.3. In Sect. 3, we apply the proposed RBM to a classification problem and show that it is also effective in such type of problems. Finally, the conclusion is given in Sect. 4.

2 Restricted Boltzmann Machine with Multivalued Hidden Variables

Let us consider a bipartite graph consisting of two different layers: the visible layer and hidden layer, as shown in Fig. 1. Binary (or bipolar) visible variables, $\varvec{v} {:}{=} \{v_i \in \{-1,+1\} \mid i \in V\}$, are assigned to the corresponding nodes in the visible layer. The corresponding hidden variables, $\varvec{h} {:}{=} \{h_j \in \mathcal {X} (s) \mid j \in H\}$, are assigned to the nodes in the hidden layer, where $\mathcal {X} (s)$ is the sample space defined by

$$\begin{aligned} \mathcal {X} (s){:}{=} \big \{ (2k - s)/s \mid k= 0,1,\ldots ,s \big \} \quad s \in \mathbb {N}, \end{aligned}$$

(1)

where $\mathbb {N}$ is the set of all natural numbers. For example, $\mathcal {X} (1) = \{-1,+1\}$, $\mathcal {X} (2) = \{-1,0,+1\}$, and $\mathcal {X} (3) = \{-1,-1/3, +1/3,+1\}$. Namely, $\mathcal {X} (s)$ is the set of values that evenly partition the interval $[-1,+1]$ into $(s+1)$ parts. We define that in the limit of $s \rightarrow \infty$, $\mathcal {X} (s)$ becomes a continuous space $[-1, +1]$, i.e., $\mathcal {X} (\infty ) = [-1,+1]$. On the bipartite graph, we define the energy function for $s \in \mathbb {N}$ as

$$\begin{aligned} E_s(\varvec{v},\varvec{h} ; \theta ){:}{=} -\sum _{i \in V}b_i v_i-\sum _{j \in H}c_j h_j -\sum _{i \in V}\sum _{j \in H}w_{i,j}v_i h_j, \end{aligned}$$

(2)

where $\{b_i\}$, $\{c_j\}$, and $\{w_{i,j}\}$ are the learning parameters of the energy function, and they are collectively denoted by $\theta$. Specifically, $\{b_i\}$ and $\{c_j\}$ are the biases for the visible and hidden variables, respectively, and $\{w_{i,j}\}$ are the couplings between the visible and hidden variables. Our RBM is defined in the form of a Boltzmann distribution in terms of the energy function given in Eq. (2):

$$\begin{aligned} P_s(\varvec{v}, \varvec{h} \mid \theta ){:}{=}\frac{\omega (s)}{Z_s(\theta )} \exp \big (-E_s(\varvec{v},\varvec{h} ; \theta )\big ), \end{aligned}$$

(3)

where

$$\begin{aligned} Z_s(\theta ){:}{=}\sum _{\varvec{v} \in \{-1,+1\}^{|V|}}\sum _{\varvec{h} \in \mathcal {X} (s)^{|H|}} \omega (s)\exp \big (-E_s(\varvec{v},\varvec{h} ; \theta )\big ) \end{aligned}$$

(4)

is the partition function. The multiple summations in Eq. (4) mean

$$\begin{aligned} \sum _{\varvec{v} \in \{-1,+1\}^{|V|}}&= \sum _{v_1 \in \{0,1\}}\sum _{v_2 \in \{0,1\}}\cdots \sum _{v_{|V|} \in \{0,1\}}, \\ \sum _{\varvec{h} \in \mathcal {X} (s)^{|H|}}&= \sum _{h_1 \in \mathcal {X} (s)}\sum _{h_2 \in \mathcal {X} (s)}\cdots \sum _{h_{|H|} \in \mathcal {X} (s)}. \end{aligned}$$

The factor $\omega (s){:}{=} \{2/(s + 1)\}^{|H|}$ appearing in Eqs. (3) and (4) is a constant unrelated to $\varvec{v}$ and $\varvec{h}$. Although it vanishes by reducing the fraction in Eq. (3), we leave it for the sake of the subsequent analysis. The factor lets the summation over $h_j$ be a Riemann sum and prevents the divergence of the partition function in $s \rightarrow \infty$. It is noteworthy that when $s = 1$, Eq. (3) is equivalent to the binary-RBM.

The marginal distribution of RBM is expressed as

$$\begin{aligned} P_s(\varvec{v}\mid \theta )&=\sum _{\varvec{h} \in \mathcal {X} (s)^{|H|}}P_s(\varvec{v}, \varvec{h} \mid \theta )\nonumber \\&=\frac{1}{Z_s(\theta )}\exp \Big (\sum _{i \in V}b_i v_i + \sum _{j \in H}\ln \phi _s\big (\lambda _j(\varvec{v},\theta )\big )\Big ), \end{aligned}$$

(5)

where $\lambda _j(\varvec{v},\theta ){:}{=} c_j + \sum _{i \in V}w_{i,j}v_i$ and $\phi _s(x){:}{=} \sum _{h \in \mathcal {X} (s)}2(s+1)^{-1} e^{x h}$. It is noteworthy that factor $2(s + 1)^{-1}$ in the definition of $\phi _s(x)$ comes from $\omega (s)$. Using the formula of geometric series, we obtain

$$\begin{aligned} \phi _s(x) =\frac{2\sinh \{(s+1)x/s\}}{(s + 1)\sinh (x/s)} \end{aligned}$$

(6)

for $1 \le s < \infty$. When $s \rightarrow \infty$, we obtain

$$\begin{aligned} \phi _{\infty }(x) =\int _{-1}^{+1} e^{x h} {\text {d}}h = \frac{2 \sinh x}{x}. \end{aligned}$$

(7)

The additive factor, $2(s + 1)^{-1}$, ensures that $\lim _{s\rightarrow \infty }\phi _s(x) = \phi _{\infty }(x)$. The conditional distributions are

$$\begin{aligned} P_s(\varvec{v} \mid \varvec{h}, \theta )&=\prod _{i \in V}P_s(v_i \mid \varvec{h}, \theta ),\quad P_s(v_i \mid \varvec{h}, \theta ) \propto \exp \big (\xi _i(\varvec{h},\theta ) v_i\big ), \end{aligned}$$

(8)

$$\begin{aligned} P_s(\varvec{h} \mid \varvec{v}, \theta )&=\prod _{j \in H}P_s(h_j \mid \varvec{v}, \theta ),\quad P_s(h_j \mid \varvec{v}, \theta ) \propto \exp \big (\lambda _j(\varvec{v},\theta ) h_j\big ), \end{aligned}$$

(9)

where $\xi _i(\varvec{h},\theta ){:}{=} b_i + \sum _{j \in H}w_{i,j}h_j$. We can easily sample $\varvec{v}$ from a given $\varvec{h}$ using Eq. (8) and sample $\varvec{h}$ from a given $\varvec{v}$ using Eq. (9). Alternately repeating these two kinds of conditional samplings yields a (blocked) Gibbs sampling on the RBM. It is noteworthy that when $s \rightarrow \infty$, the conditional sampling of $\varvec{h}$ using Eq. (9) can be implemented using the inverse transform sampling. The cumulative distribution function of $P_{\infty }(h_j \mid \varvec{v}, \theta )$ is

$$\begin{aligned} F(x){:}{=}\int _{-1}^x \frac{\exp \big (\lambda _j(\varvec{v},\theta ) h_j\big )}{\phi _{\infty }\big (\lambda _j(\varvec{v},\theta )\big )} {\text {d}}h_j=\frac{\exp \big (\lambda _j(\varvec{v},\theta ) x\big )- \exp \big (-\lambda _j(\varvec{v},\theta ) \big )}{2 \sinh \lambda _j(\varvec{v},\theta )}, \end{aligned}$$

and therefore, its inverse function is

$$\begin{aligned} F^{-1}(u){:}{=}\frac{1}{\lambda _j(\varvec{v},\theta )} \ln \big \{\exp \big (-\lambda _j(\varvec{v},\theta ) \big ) + 2 u \sinh \lambda _j(\varvec{v},\theta )\big \}. \end{aligned}$$

$F^{-1}(u)$ is the sampled value of $h_j$ from $P_{\infty }(h_j \mid \varvec{v}, \theta )$, where u is a sample point from the uniform distribution over [0, 1].

2.1 Log-Likelihood Function and Its Gradients

Given N training data points for the visible layer, $D_V{:}{=}\{ \mathbf {v} ^{(\mu )} \in \{-1,+1\}^{|V|} \mid \mu = 1,2,\ldots ,N\}$, the learning of RBM is done by maximizing the log-likelihood function (or the negative cross-entropy loss function), defined by

$$\begin{aligned} l_s(\theta ){:}{=}\frac{1}{N}\sum _{\mu = 1}^N \ln P_s( \mathbf {v} ^{(\mu )} \mid \theta ), \end{aligned}$$

(10)

with respect to $\theta$ (namely, the maximum likelihood estimation). The distribution in the logarithmic function in Eq. (10) is the marginal distribution obtained in Eq. (5). The log-likelihood function is regarded as the negative training error. Usually, the log-likelihood function is maximized using a gradient ascent method. The gradients of the log-likelihood function with respect to the learning parameters are as follows.

$$\begin{aligned} \frac{\partial l_s(\theta )}{\partial b_i}&=\frac{1}{N}\sum _{\mu = 1}^N \text {v} _i^{(\mu )} -\langle v_i \rangle _s, \end{aligned}$$

(11)

$$\begin{aligned} \frac{\partial l_s(\theta )}{\partial c_j}&=\frac{1}{N}\sum _{\mu = 1}^N \psi _s\big (\lambda _j( \mathbf {v} ^{(\mu )},\theta )\big ) -\langle h_j \rangle _s, \end{aligned}$$

(12)

$$\begin{aligned} \frac{\partial l_s(\theta )}{\partial w_{i,j}}&=\frac{1}{N}\sum _{\mu = 1}^N \text {v} _i^{(\mu )} \psi _s\big (\lambda _j( \mathbf {v} ^{(\mu )},\theta )\big ) -\langle v_ih_j \rangle _s, \end{aligned}$$

(13)

where $\langle \cdots \rangle _s$ is the expectation of RBM, i.e.,

$$\begin{aligned} \langle A(\varvec{v},\varvec{h}) \rangle _s{:}{=}\sum _{\varvec{v} \in \{-1,+1\}^{|V|}} \sum _{\varvec{h} \in \mathcal {X} (s)^{|H|}} A(\varvec{v},\varvec{h}) P_s(\varvec{v}, \varvec{h} \mid \theta ), \end{aligned}$$

and

$$\begin{aligned} \psi _s(x){:}{=}\frac{\partial }{\partial x}\ln \phi _s(x)= \left\{ \begin{array}{ll} \frac{s + 1}{s \tanh \{(s + 1)x/s\}} - \frac{1}{s \tanh(x/s)} &{} 1 \le s < \infty \\ \frac{1}{\tanh x} - \frac{1}{x} &{} s \rightarrow \infty \end{array}\right. . \end{aligned}$$

(14)

The log-likelihood function can be maximized by a gradient ascent method with the gradients expressed in Eqs. (11)–(13). However, the evaluation of the expectations, $\langle \cdots \rangle _s$, included in the above gradients is computationally hard. The computation time of the evaluation grows exponentially as the number of variables increases. Therefore, in practice, an approximate approach is used, for example, CD [5], pseudo-likelihood [10], composite likelihood [16], Kullback–Leibler importance estimation procedure (KLIEP) [17], and Thouless–Anderson–Palmer (TAP) approximation [3]. In particular, the CD method is the most popular method. In the CD method, the intractable expectations in Eqs. (11)–(13) are approximated by the sample averages of the sampled points in which each sampled point is generated from the (one-time) Gibbs sampling using Eqs. (8) and (9), starting from each data point $\mathbf {v} ^{(\mu )}$.

2.2 Numerical Experiment Using Artificial Data

In the numerical experiments in this section, we used two RBMs: the generative RBM (gRBM), $P_1^{ \text {gen} }$, and the learning RBM (tRBM), $P_s^{ \text {train} }$. We obtained $N = 200$ artificial training data points, $D_V$, from the gRBM using Gibbs sampling, and subsequently, we trained the tRBM using the data points. The sizes of the visible layers of both RBMs were the same, namely, $|V| = 8$. The sizes of the hidden layers of the gRBM and tRBM were set to $|H| = 4$ and $|H| = 4 + R$, respectively. The sample space of the hidden variables in the gRBM was $\mathcal {X} (1) = \{-1,+1\}$, implying that the gRBM is the binary-RBM. The parameters of gRBM were randomly drawn: $b_i,c_j \sim G(0,0.1^2)$ and

$$\begin{aligned} w_{i,j}\sim U [-\sqrt{6/(|V|+|H|)},\sqrt{6/(|V|+|H|)}] \end{aligned}$$

(15)

(Xavier’s initialization [4]), where $G(\mu ,\sigma ^2)$ is the Gaussian distribution and $U[ \text {min} , \text {max} ]$ is the uniform distribution.

We trained the tRBM using the CD method. In the training, the parameters of tRBM were initialized by $b_i = c_j = 0$ and Eq. (15). In the gradient ascent method, we used the full batch learning with the Adam method [7]. The quality of learning was measured using the Kullback–Leibler divergence (KLD) between the gRBM and tRBM:

$$\begin{aligned} \text {KLD} {:}{=}\frac{1}{|V|}\sum _{\varvec{v} \in \{-1,+1\}^{|V|}}P_1^{ \text {gen} }(\varvec{v}) \ln \frac{P_1^{ \text {gen} }(\varvec{v})}{P_s^{ \text {train} }(\varvec{v})}. \end{aligned}$$

(16)

The KLD is regarded as the (pseudo) distance between the gRBM and tRBM. Thus, it is a type of generalization error. We can evaluate the KLD (the generalization error) and log-likelihood function in Eq. (10) (the negative training error) because the sizes of the RBMs are not large.

Figure 2a, b show the KLDs against the number of parameter updates (i.e., the number of gradient ascent updates). We observe that all KLDs increase as the learnings proceed owing to the effect of over-fitting. In Fig. 2a, because the gRBM and tRBM have the same structure (in other words, there is no model error), the effect of over-fitting is not severe. In contrast, in Fig. 2b, because the tRBM is more flexible than the gRBM, the effect of over-fitting tends to become severe. In fact, in Fig. 2b, the KLDs increase more rapidly as the learnings proceed. The increase in the KLD of higher s is evidently slower. Figure3a, b show the log-likelihood functions divided by |V| against the number of parameter updates. We observe that the log-likelihood function with lower s grows more rapidly. In other words, the training error in the tRBM with lower s decreases more rapidly. These results indicate that the multivalued hidden variables suppress over-fitting. In these experiments, the tRBM with $s = \infty$ is the best in terms of generalization.

2.3 Effect of Multivalued Hidden Variables

In the numerical experiments described in the previous section, we demonstrated that the multivalued hidden variables suppress over-fitting. In this section, we provide an insight into the effect of multivalued hidden variables using a toy example. Although the consideration presented below is for a simple RBM, which is significantly different from practical RBMs, it is expected to provide an insight into the effect of multivalued hidden variables.

First, let us consider a simple RBM with two visible variables:

$$\begin{aligned} P_s(v_1,v_2, \varvec{h} \mid w) = \frac{\omega (s)}{Z_s(w)}\exp \Big (w \sum _{i=1}^2\sum _{j \in H} v_ih_j\Big ). \end{aligned}$$

(17)

The marginal distribution of Eq. (17) is

$$\begin{aligned} P_s(v_1,v_2 \mid w)=\frac{\exp \big \{ |H| \ln \phi _s \big (w(v_1 + v_2)\big )\big \}}{ 2\exp \big \{ |H| \ln \phi _s(2w)\big \} + 2 \exp \big ( |H| \ln 2\big )}, \end{aligned}$$

(18)

where we used $\lim _{x \rightarrow 0}\phi _s(x) = 2$ and $\phi _s(x) = \phi _s(-x)$. Because $v_1, v_2 \in \{-1,+1\}$, Eq. (18) can be expanded as

$$\begin{aligned} P_s(v_1,v_2 \mid w) =\frac{1+ m_{s,1}(w) v_1 + m_{s,2}(w) v_2 + \alpha _s(w)v_1v_2}{4}, \end{aligned}$$

(19)

where

$$\begin{aligned} m_{s,i}(w)&{:}{=} \sum _{v_1,v_2 \in \{-1,+1\}} v_i P_s(v_1,v_2 \mid w) = 0, \\ \alpha _s(w)&{:}{=} \sum _{v_1,v_2 \in \{-1,+1\}} v_1v_2 P_s(v_1,v_2 \mid w)\\&=\frac{\exp \big \{ |H| \ln \phi _s(2w)\big \} - \exp \big ( |H| \ln 2\big )}{\exp \big \{ |H| \ln \phi _s(2w)\big \} + \exp \big ( |H| \ln 2\big )}\ge 0. \end{aligned}$$

Next, we consider the empirical distribution of $D_V$:

$$\begin{aligned} Q_{D_V}(v_1,v_2){:}{=} \frac{1}{N}\sum _{\mu =1}^N \prod _{i = 1}^2\delta \big (v_i, \text {v} _i^{(\mu )}\big ), \end{aligned}$$

where $\delta (x,y)$ is the Kronecker delta function. Similar to Eq. (19), the empirical distribution is expanded as

$$\begin{aligned} Q_{D_V}(v_1,v_2) =\frac{1+ d_1 v_1 + d_2 v_2 + \beta v_1v_2}{4}, \end{aligned}$$

(20)

where $d_i {:}{=} \sum _{\mu =1}^N \text {v} _i^{(\mu )} / N$ and $\beta {:}{=} \sum _{\mu =1}^N \text {v} _1^{(\mu )} \text {v} _2^{(\mu )} / N$. For simplicity, in the following discussion, we assume that $d_1 = d_2 = 0$ and $\beta \ge 0$. Under this assumption, using the expanded forms in Eqs. (19) and (20), the log-likelihood function of the simple RBM is expressed by

$$\begin{aligned} l_s(w)&= \sum _{v_1,v_2 \in \{-1,+1\}}Q_{D_V}(v_1,v_2)\ln P_s(v_1,v_2 \mid w) \nonumber \\&=\sum _{v_1,v_2 \in \{-1,+1\}}\frac{1 + \beta v_1 v_2}{4}\ln \frac{1 + \alpha _s(w)v_1v_2}{4}. \end{aligned}$$

(21)

Ultimately, the aim of the maximum likelihood estimation is to find the value of w that realizes $P_s(v_1,v_2 \mid w) = Q_{D_V}(v_1,v_2)$ or in other words, to find a value of $w_s^*$ that satisfies $\alpha _s(w_s^*) = \beta$. The log-likelihood function in Eq. (21) is globally maximized at $w = w_s^*$ and the RBM with $w_s^*$ over-fits the data distribution. It can be shown that the function $\alpha _s(w)$ has the following three properties: (i) it is symmetric with respect to w, (ii) it monotonically increases with an increase in $w \ge 0$, and (iii) it monotonically decreases with an increase in s when $|x| \not =0$. The function $\alpha _s(w)$ with $|H| = 2$ is shown in Fig. 4 (a) as an example. These three properties lead to the inequality $|w_s^*| < |w_{s+1}^*|$ for a certain $\beta > 0$, which implies that the global maximum point of the log-likelihood function in Eq. (21) moves away from the origin, $w = 0$, as s increases (see Fig. 4b).

Usually, the initial value of w is set to a value around the origin. As shown in Fig. 4b, the global maximum point moves closer to the origin and the peak becomes sharper (in other words, the global maximum point becomes the stronger attractor) as s decreases. This implies that with a gradient ascent type of algorithm, the RBM with a lower s can reach the global maximum point more rapidly and causes over-fitting during an early stage of the learning. Whereas the convergence with the global maximum point of the RBM with a higher s is slower and it prevents over-fitting during an early stage of the learning^{Footnote 1}. In fact, the increases in the generalization error (the KLD) and negative training error (the log-likelihood function) become faster as s decreases in the numerical results obtained in the previous section (cf. Figs. 2, 3).

From the above analysis, we found that the global maximum point moves away from the origin and becomes a weaker attractor as s increases. This could lead to some expectations, for example: (i) in a more practical RBM, its log-likelihood function usually has several local maximum points, and thus, the RBM with a higher s is more easily trapped by one of the local maximum points before converging with the global maximum point (namely, the over-fitting point) and (ii) some regularization methods, such as early stopping or $L_2$ regularization, are more effective in the RBM with a higher s.

3 Application to Classification Problem

Let us consider a classification (or pattern recognition) problem in which an n-dimensional input vector $\varvec{x} = (x_1, x_2,\ldots , x_n)^{ \text {T} } \in \mathbb {R}^n$ is classified into K different classes, $C_1, C_2, \ldots , C_K$. It is convenient to use a 1-of-K representation (or a 1-of-K coding) to identify each class [1]. In the 1-of-K representation, each class corresponds to the K-dimensional vector $\varvec{t} = (t_1, t_2,\ldots , t_K)^{ \text {T} }$, where $t_k \in \{0,1\}$ and $\sum _{k = 1}^K t_k = 1$, i.e., $\varvec{t}$ is a vector in which the value of only one element is one and the remaining elements are zero. When $t_k = 1$, $\varvec{t}$ indicates class $C_k$. For simplicity of the notation, we denote the 1-of-K vector, whose kth element is one, by $\varvec{1}_k$. In the following section, we consider the application of the proposed RBM to the classification problem.

3.1 Discriminative Restricted Boltzmann Machine

A discriminative restricted Boltzmann machine (DRBM) was proposed to solve the classification problem [8, 9], which is a conditional distribution of the output 1-of-K vector $\varvec{t}$ conditioned with a continuous input vector $\varvec{x}$. The conventional DRBM can be obtained by a simple extension to the binary-RBM. The DRBM is obtained by the following process. The visible variables in the RBM are divided into two layers, the input and output layers. The K visible variables assigned to the output layer, $\varvec{t}$, are redefined as the 1-of-K vector with $\varvec{1}_k$ as its realization (i.e., $\varvec{t} \in \{\varvec{1}_k \mid k = 1,2,\ldots , K\}$), and the n visible variables assigned to the input layer, $\varvec{x}$, are redefined as the continuous input vector (see Fig. 5). Subsequently, we make a conditional distribution conditioned with the variables in the input layer: $P(\varvec{t}, \varvec{h} \mid \varvec{x})$. Finally, by marginalizing the hidden variables out, we obtain the DRBM: $P(\varvec{t}\mid \varvec{x}) = \sum _{\varvec{h}}P(\varvec{t}, \varvec{h} \mid \varvec{x})$.

By using the proposed RBM instead of the binary-RBM, we obtain an extension to the conventional DRBM, i.e., we obtain a DRBM with multivalued hidden variables. The proposed DRBM for $s \in \mathbb {N}$ is obtained by

$$\begin{aligned} P_s(\varvec{t} \mid \varvec{x},\theta )&{:}{=}\frac{1}{ \mathcal {Z} _s(\varvec{x},\theta )} \exp \Big (\sum _{k=1}^K b_kt_k + \sum _{j \in H} \ln \phi _s\big (\zeta _j(\varvec{t},\varvec{x}, \theta ) \big )\Big ). \end{aligned}$$

(22)

where $\zeta _j(\varvec{t},\varvec{x}, \theta ){:}{=} c_j + \sum _{k=1}^K w_{j,k}^{(2)}t_k + \sum _{i=1}^n w_{i,j}^{(1)}x_i$ and

$$\begin{aligned} \mathcal {Z} _s(\varvec{x},\theta ){:}{=}\sum _{k=1}^K \exp \Big (b_k + \sum _{j \in H} \ln \phi _s\big (\zeta _j(\varvec{1}_k,\varvec{x}, \theta ) \big )\Big ). \end{aligned}$$

(23)

The function $\phi _s(x)$ appearing in Eqs. (22) and (23) is already defined in Eq. (6). It is noteworthy that when $s = 1$, Eq. (22) is equivalent to the conventional DRBM proposed in Ref. [8]. Eq. (22) is regarded as the class probability, indicating that $P_s(\varvec{t} = \varvec{1}_k \mid \varvec{x},\theta )$ is the probability of the input $\varvec{x}$ belonging to class $C_k$. The input $\varvec{x}$ should be assigned into a class that gives the maximum class probability.

Given N supervised training data points, $D{:}{=}\{( \mathbf {t} ^{(\mu )} , \mathbf {x} ^{(\mu )}) \mid \mu = 1,2,\ldots , N\}$, the log-likelihood function of the proposed DRBM in Eq. (22) is defined as

$$\begin{aligned} l_s^{\dagger }(\theta ){:}{=}\frac{1}{N}\sum _{\mu =1}^N \ln P_s( \mathbf {t} ^{(\mu )} \mid \mathbf {x} ^{(\mu )},\theta ). \end{aligned}$$

(24)

The gradients of the log-likelihood function with respect to the parameters are obtained as follows.

$$\begin{aligned} \frac{\partial l_s^{\dagger }(\theta )}{\partial b_k}&=\frac{1}{N}\sum _{\mu = 1}^N \big [ \text {t} _k^{(\mu )} - P_s(\varvec{1}_k \mid \mathbf {x} ^{(\mu )},\theta )\big ], \end{aligned}$$

(25)

$$\begin{aligned} \frac{\partial l_s^{\dagger }(\theta )}{\partial c_j}&=\frac{1}{N}\sum _{\mu = 1}^N \big [\psi _s\big (\zeta _j( \mathbf {t} ^{(\mu )}, \mathbf {x} ^{(\mu )}, \theta )\big ) -\langle \psi _s\big (\zeta _j(\varvec{t}, \mathbf {x} ^{(\mu )}, \theta ) \rangle _{\varvec{t}}^{(\mu , s)} \big ], \end{aligned}$$

(26)

$$\begin{aligned} \frac{\partial l_s^{\dagger }(\theta )}{\partial w_{i,j}^{(1)}}&=\frac{1}{N}\sum _{\mu = 1}^N \text {x} _i^{(\mu )} \big [\psi _s\big (\zeta _j( \mathbf {t} ^{(\mu )}, \mathbf {x} ^{(\mu )}, \theta )\big ) -\langle \psi _s\big (\zeta _j(\varvec{t}, \mathbf {x} ^{(\mu )}, \theta ) \rangle _{\varvec{t}}^{(\mu , s)}\big ], \end{aligned}$$

(27)

$$\begin{aligned} \frac{\partial l_s^{\dagger }(\theta )}{\partial w_{j,k}^{(2)}}&=\frac{1}{N}\sum _{\mu = 1}^N \psi _s\big (\zeta _j(\varvec{1}_k, \mathbf {x} ^{(\mu )}, \theta )\big ) \big [ \text {t} _k^{(\mu )} - P_s(\varvec{1}_k \mid \mathbf {x} ^{(\mu )},\theta )\big ], \end{aligned}$$

(28)

where $\langle \cdots \rangle _{\varvec{t}}^{(\mu , s)}$ denotes the expectation defined by

$$\begin{aligned} \langle A(\varvec{t}) \rangle _{\varvec{t}}^{(\mu , s)}{:}{=}\sum _{k=1}^K A(\varvec{1}_k) P_s(\varvec{1}_k \mid \mathbf {x} ^{(\mu )},\theta ). \end{aligned}$$

The function $\psi _s(x)$ appearing in the above gradients is already defined in Eq. (14). It is noteworthy that the gradients expressed in Eqs. (25)–(28) are computed without an approximation, unlike those in the RBM, owing to the special structure of DRBM. In the training, we maximize $l_s^{\dagger }(\theta )$ with respect to $\theta$ using a gradient ascent method with Eqs. (25)–(28).

3.2 Numerical Experiment Using MNIST Data Set

In this section, we show the results of the numerical experiment using MNIST. MNIST is a data set of 10 different handwritten digits, $0, 1, \ldots ,$ and 9 and is composed of 60, 000 training data points and 10, 000 test data points. Each data point includes the input data, a $28 \times 28$ digit (8-bit) image, and the corresponding target digit label. Therefore, for the data set, we set $n = 784$ and $K = 10$. All input images were normalized by dividing by 255 during preprocessing.

We trained the proposed DRBM with $|H|=200$ using $N = 1000$ training data points in MNIST and tested it using 10000 test data points. In the training, we used the stochastic gradient ascent (SGA), for which the mini-batch size was $B = 100$, with the AdaMax optimizer [7]. All coupling parameters were initialized by the Xavier method [4], and all bias parameters were initialized by zero. Figure 6 shows the plots of the missclassification rates for (a) training data set and (b) test data set versus the number of parameter updates. All input images in the test data set were corrupted by the Gaussian noise with $\sigma = 120$ before the normalization^{Footnote 2}. We observe that the DRBM with $s = \infty$ is better in terms of generalization because it shows a higher training error and lower test error. This indicates that the multivalued hidden variables are also effective in the DRBM.

4 Conclusion

In this paper, we proposed an RBM with multivalued hidden variables, which is a simple extension to the conventional binary-RBM and showed that the proposed RBM is better than the binary-RBM in terms of the generalization property via numerical experiments conducted on CD learning with artificial data (in Sect. 2.2) and classification problem with MNIST (in Sect. 3.2).

It is important to understand the reason why the multivalued hidden variables are effective in terms of over-fitting. We provided a basic insight into it by analyzing a simple example in Sect. 2.3. However, practical RBMs are much more complex than the simple example used in this study. Therefore, we need to perform further analysis to clarify this reason. We think that a mean-field analysis [11] can be used to perform the further analysis. Moreover, a criteria for over-fitting was provided in Ref. [2]. The relationship between the criteria and our multivalued hidden variables is also interesting. These issues will be addressed in our future studies.

Notes

Because the value of the log-likelihood function at the global maximum point for a higher s is the same as that for a lower s, the RBM with a higher s also causes over-fitting at that point.
Each corrupted input image $\hat{ \mathbf {x} }$ was created from the corresponding original image $\mathbf {x}$ by $\hat{ \text {x} }_i = \text {x} _i + \epsilon _i$, where $\epsilon _i$ is the additive white Gaussian noise drawn from $G(0,120^2)$. If $\hat{ \text {x} }_i > 255$, we set $\hat{ \text {x} }_i = 255$ and if $\hat{ \text {x} }_i < 0$, we set $\hat{ \text {x} }_i = 0$.

References

Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer.
Google Scholar
Coolen, A. C. C., Barrett, J. E., Paga, P., & Perez-Vicente, C. J. (2017). Replica analysis of overfitting in regression models for time-to-event data. Journal of Physics A: Mathematical and Theoretical, 50(37), 375001.
Article Google Scholar
Gabrié, M., Tramel, E. W., & Krzakala, F. (2015). Training restricted boltzmann machines via the thouless-anderson-palmer free energy. In Proceedings of the 28th International Conference on Neural Information Processing Systems (pp. 640–648).
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (Vol. 9, pp. 249–256).
Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.
Article Google Scholar
Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief net. Neural Computation, 18(7), 1527–1554.
Article Google Scholar
Kingma, D.P., & Ba, L.J. (2015). Adam: A method for stochastic optimization. In Proccedings in the 3rd International Conference on Learning Representations (pp. 1–13).
Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted boltzmann machines. In Proceedings of the 25th International Conference on Machine Learning (pp. 536–543).
Larochelle, H., Mandel, M., Pascanu, R., & Bengio, Y. (2012). Learning algorithms for the classification restricted boltzmann machine. The Journal of Machine Learning Research, 13(1), 643–669.
Google Scholar
Marlin, B., Swersky, K., Chen, B., & Freitas, N. (2010). Inductive principles for restricted boltzmann machine learning. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 9, 509–516.
Google Scholar
Nishimori, H. (2001). Statistical physics of spin glass and information processing: An introduction. Oxford: Oxford University Press.
Book Google Scholar
Salakhutdinov, R., & Hinton, G. E. (2009). Deep boltzmann machines. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS 2009) (pp. 448–455).
Salakhutdinov, R., & Hinton, G. E. (2012). An efficient learning procedure for deep boltzmann machines. Neural Computation, 24(8), 1967–2006.
Article Google Scholar
Salakhutdinov, R., Mnih, A., & Hinton, G. E. (2007). Restricted boltzmann machines for collaborative filtering. In Proceedings of the 24th International Conference on Machine Learning (pp. 791–798).
Smolensky, P. (1986) Information processing in dynamical systems: Foundations of harmony theory. In D. E. Rumelhart, J. L. McClelland (eds.) Parallel distributed processing: Explorations in the microstructure of cognition vol. 1 (pp. 194–281). Cambridge, MA, USA: MIT Press.
Google Scholar
Yasuda, M., Kataoka, S., Waizumi, Y., & Tanaka, K. (2012). Composite likelihood estimation for restricted boltzmann machines. In Proceedings of 21st International Conference on Pattern Recognition (pp. 2234–2237).
Yasuda, M., Sakurai, T., & Tanaka, K. (2011). Learning algorithm in restricted boltzmann machines using Kullback–Leibler importance estimation procedure. Nonlinear Theory and Its Applications, IEICE, 2(2), 153–164.
Article Google Scholar
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67, 301–320.
Article Google Scholar

Download references

Acknowledgements

This work was partially supported by JSPS KAKENHI (Grant Numbers 15K00330, 15H03699, 18K11459, and 18H03303), JST CREST (Grant number JPMJCR1402), and the COI Program from the JST (Grant number JPMJCE1312).

Author information

Authors and Affiliations

ALBERT Inc., Tokyo, Japan
Yuuki Yokoyama
Graduate School of Science and Engineering, Yamagata University, Yamagata, Japan
Tomu Katsumata & Muneki Yasuda

Authors

Yuuki Yokoyama
View author publications
You can also search for this author in PubMed Google Scholar
Tomu Katsumata
View author publications
You can also search for this author in PubMed Google Scholar
Muneki Yasuda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Muneki Yasuda.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Yokoyama, Y., Katsumata, T. & Yasuda, M. Restricted Boltzmann Machine with Multivalued Hidden Variables. Rev Socionetwork Strat 13, 253–266 (2019). https://doi.org/10.1007/s12626-019-00042-4

Download citation

Received: 13 February 2019
Accepted: 13 June 2019
Published: 27 June 2019
Issue Date: October 2019
DOI: https://doi.org/10.1007/s12626-019-00042-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Restricted Boltzmann Machine with Multivalued Hidden Variables

Abstract

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

A survey on Image Data Augmentation for Deep Learning

Development and Application of Artificial Neural Network

1 Introduction

2 Restricted Boltzmann Machine with Multivalued Hidden Variables

2.1 Log-Likelihood Function and Its Gradients

2.2 Numerical Experiment Using Artificial Data

2.3 Effect of Multivalued Hidden Variables

3 Application to Classification Problem

3.1 Discriminative Restricted Boltzmann Machine

3.2 Numerical Experiment Using MNIST Data Set

4 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Restricted Boltzmann Machine with Multivalued Hidden Variables

Abstract

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

A survey on Image Data Augmentation for Deep Learning

Development and Application of Artificial Neural Network

1 Introduction

2 Restricted Boltzmann Machine with Multivalued Hidden Variables

2.1 Log-Likelihood Function and Its Gradients

2.2 Numerical Experiment Using Artificial Data

2.3 Effect of Multivalued Hidden Variables

3 Application to Classification Problem

3.1 Discriminative Restricted Boltzmann Machine

3.2 Numerical Experiment Using MNIST Data Set

4 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation