1 Introduction

Generalization is one of the most important goals in statistical machine learning problems [1]. In various standard machine learning techniques, given a particular data set, we fit our probabilistic learning model to the empirical distribution (or the data distribution) of the data set. When our learning model is sufficiently flexible, it can fit the empirical distribution exactly via an appropriate learning method. A learning model that is too close to the empirical distribution frequently gives poor results for new data points. This situation is known as over-fitting. Over-fitting impedes generalization; therefore, techniques that can suppress over-fitting are needed to achieve good generalizations. Regularizations, such as \(L_1\) and \(L_2\) regularizations or their combination (the elastic net) [18], are popular techniques used for this purpose.

Here, we focus on a restricted Boltzmann machine (RBM) [5, 15]. RBMs have a wide range of applications such as collaborating filtering [14], classification [8], and deep learning [6, 12, 13]. The suppression of over-fitting is also important in RBMs. An RBM is a probabilistic neural network defined on a bipartite undirected graph comprising two different layers: visible layer and hidden layer. The visible layer, which consists of visible (random) variables, directly corresponds to the data points, while the hidden layer, which consists of hidden (random) variables, does not. The hidden layer creates complex correlations among the visible variables. The sample space of the visible variables is determined by the range of data elements, whereas the sample space of the hidden variables can be set freely. Typically, the hidden variables are given binary values (\(\{0,1\}\) or \(\{-1,+1\}\)).

In this study, we propose an RBM with multivalued hidden variables. The proposed RBM is a very simple extension of the conventional RBM with binary-hidden variables (referred to as the binary-RBM in this paper). However, we demonstrate that the proposed RBM is better than the binary-RBM in terms of suppressing the over-fitting. The remainder of this paper is organized as follows. We define the proposed RBM in Sect. 2 and explain its maximum likelihood estimation in Sect. 2.1. In Sect. 2.2, we demonstrate the validity of the proposed RBM using numerical experiments for contrastive divergence (CD) learning [5] with artificial data. We give an insight on the effect of our extension (i.e., the effect of multivalued hidden variables) using a toy example in Sect. 2.3. In Sect. 3, we apply the proposed RBM to a classification problem and show that it is also effective in such type of problems. Finally, the conclusion is given in Sect. 4.

2 Restricted Boltzmann Machine with Multivalued Hidden Variables

Fig. 1
figure 1

Bipartite graph consisting of two layers: the visible layer and the hidden layer. V and H are the sets of indices of the nodes in the visible layer and hidden layer, respectively

Let us consider a bipartite graph consisting of two different layers: the visible layer and hidden layer, as shown in Fig. 1. Binary (or bipolar) visible variables, \(\varvec{v} {:}{=} \{v_i \in \{-1,+1\} \mid i \in V\}\), are assigned to the corresponding nodes in the visible layer. The corresponding hidden variables, \(\varvec{h} {:}{=} \{h_j \in \mathcal {X} (s) \mid j \in H\}\), are assigned to the nodes in the hidden layer, where \(\mathcal {X} (s)\) is the sample space defined by

$$\begin{aligned} \mathcal {X} (s){:}{=} \big \{ (2k - s)/s \mid k= 0,1,\ldots ,s \big \} \quad s \in \mathbb {N}, \end{aligned}$$
(1)

where \(\mathbb {N}\) is the set of all natural numbers. For example, \(\mathcal {X} (1) = \{-1,+1\}\), \(\mathcal {X} (2) = \{-1,0,+1\}\), and \(\mathcal {X} (3) = \{-1,-1/3, +1/3,+1\}\). Namely, \(\mathcal {X} (s)\) is the set of values that evenly partition the interval \([-1,+1]\) into \((s+1)\) parts. We define that in the limit of \(s \rightarrow \infty\), \(\mathcal {X} (s)\) becomes a continuous space \([-1, +1]\), i.e., \(\mathcal {X} (\infty ) = [-1,+1]\). On the bipartite graph, we define the energy function for \(s \in \mathbb {N}\) as

$$\begin{aligned} E_s(\varvec{v},\varvec{h} ; \theta ){:}{=} -\sum _{i \in V}b_i v_i-\sum _{j \in H}c_j h_j -\sum _{i \in V}\sum _{j \in H}w_{i,j}v_i h_j, \end{aligned}$$
(2)

where \(\{b_i\}\), \(\{c_j\}\), and \(\{w_{i,j}\}\) are the learning parameters of the energy function, and they are collectively denoted by \(\theta\). Specifically, \(\{b_i\}\) and \(\{c_j\}\) are the biases for the visible and hidden variables, respectively, and \(\{w_{i,j}\}\) are the couplings between the visible and hidden variables. Our RBM is defined in the form of a Boltzmann distribution in terms of the energy function given in Eq. (2):

$$\begin{aligned} P_s(\varvec{v}, \varvec{h} \mid \theta ){:}{=}\frac{\omega (s)}{Z_s(\theta )} \exp \big (-E_s(\varvec{v},\varvec{h} ; \theta )\big ), \end{aligned}$$
(3)

where

$$\begin{aligned} Z_s(\theta ){:}{=}\sum _{\varvec{v} \in \{-1,+1\}^{|V|}}\sum _{\varvec{h} \in \mathcal {X} (s)^{|H|}} \omega (s)\exp \big (-E_s(\varvec{v},\varvec{h} ; \theta )\big ) \end{aligned}$$
(4)

is the partition function. The multiple summations in Eq. (4) mean

$$\begin{aligned} \sum _{\varvec{v} \in \{-1,+1\}^{|V|}}&= \sum _{v_1 \in \{0,1\}}\sum _{v_2 \in \{0,1\}}\cdots \sum _{v_{|V|} \in \{0,1\}}, \\ \sum _{\varvec{h} \in \mathcal {X} (s)^{|H|}}&= \sum _{h_1 \in \mathcal {X} (s)}\sum _{h_2 \in \mathcal {X} (s)}\cdots \sum _{h_{|H|} \in \mathcal {X} (s)}. \end{aligned}$$

The factor \(\omega (s){:}{=} \{2/(s + 1)\}^{|H|}\) appearing in Eqs. (3) and (4) is a constant unrelated to \(\varvec{v}\) and \(\varvec{h}\). Although it vanishes by reducing the fraction in Eq. (3), we leave it for the sake of the subsequent analysis. The factor lets the summation over \(h_j\) be a Riemann sum and prevents the divergence of the partition function in \(s \rightarrow \infty\). It is noteworthy that when \(s = 1\), Eq. (3) is equivalent to the binary-RBM.

The marginal distribution of RBM is expressed as

$$\begin{aligned} P_s(\varvec{v}\mid \theta )&=\sum _{\varvec{h} \in \mathcal {X} (s)^{|H|}}P_s(\varvec{v}, \varvec{h} \mid \theta )\nonumber \\&=\frac{1}{Z_s(\theta )}\exp \Big (\sum _{i \in V}b_i v_i + \sum _{j \in H}\ln \phi _s\big (\lambda _j(\varvec{v},\theta )\big )\Big ), \end{aligned}$$
(5)

where \(\lambda _j(\varvec{v},\theta ){:}{=} c_j + \sum _{i \in V}w_{i,j}v_i\) and \(\phi _s(x){:}{=} \sum _{h \in \mathcal {X} (s)}2(s+1)^{-1} e^{x h}\). It is noteworthy that factor \(2(s + 1)^{-1}\) in the definition of \(\phi _s(x)\) comes from \(\omega (s)\). Using the formula of geometric series, we obtain

$$\begin{aligned} \phi _s(x) =\frac{2\sinh \{(s+1)x/s\}}{(s + 1)\sinh (x/s)} \end{aligned}$$
(6)

for \(1 \le s < \infty\). When \(s \rightarrow \infty\), we obtain

$$\begin{aligned} \phi _{\infty }(x) =\int _{-1}^{+1} e^{x h} {\text {d}}h = \frac{2 \sinh x}{x}. \end{aligned}$$
(7)

The additive factor, \(2(s + 1)^{-1}\), ensures that \(\lim _{s\rightarrow \infty }\phi _s(x) = \phi _{\infty }(x)\). The conditional distributions are

$$\begin{aligned} P_s(\varvec{v} \mid \varvec{h}, \theta )&=\prod _{i \in V}P_s(v_i \mid \varvec{h}, \theta ),\quad P_s(v_i \mid \varvec{h}, \theta ) \propto \exp \big (\xi _i(\varvec{h},\theta ) v_i\big ), \end{aligned}$$
(8)
$$\begin{aligned} P_s(\varvec{h} \mid \varvec{v}, \theta )&=\prod _{j \in H}P_s(h_j \mid \varvec{v}, \theta ),\quad P_s(h_j \mid \varvec{v}, \theta ) \propto \exp \big (\lambda _j(\varvec{v},\theta ) h_j\big ), \end{aligned}$$
(9)

where \(\xi _i(\varvec{h},\theta ){:}{=} b_i + \sum _{j \in H}w_{i,j}h_j\). We can easily sample \(\varvec{v}\) from a given \(\varvec{h}\) using Eq. (8) and sample \(\varvec{h}\) from a given \(\varvec{v}\) using Eq. (9). Alternately repeating these two kinds of conditional samplings yields a (blocked) Gibbs sampling on the RBM. It is noteworthy that when \(s \rightarrow \infty\), the conditional sampling of \(\varvec{h}\) using Eq. (9) can be implemented using the inverse transform sampling. The cumulative distribution function of \(P_{\infty }(h_j \mid \varvec{v}, \theta )\) is

$$\begin{aligned} F(x){:}{=}\int _{-1}^x \frac{\exp \big (\lambda _j(\varvec{v},\theta ) h_j\big )}{\phi _{\infty }\big (\lambda _j(\varvec{v},\theta )\big )} {\text {d}}h_j=\frac{\exp \big (\lambda _j(\varvec{v},\theta ) x\big )- \exp \big (-\lambda _j(\varvec{v},\theta ) \big )}{2 \sinh \lambda _j(\varvec{v},\theta )}, \end{aligned}$$

and therefore, its inverse function is

$$\begin{aligned} F^{-1}(u){:}{=}\frac{1}{\lambda _j(\varvec{v},\theta )} \ln \big \{\exp \big (-\lambda _j(\varvec{v},\theta ) \big ) + 2 u \sinh \lambda _j(\varvec{v},\theta )\big \}. \end{aligned}$$

\(F^{-1}(u)\) is the sampled value of \(h_j\) from \(P_{\infty }(h_j \mid \varvec{v}, \theta )\), where u is a sample point from the uniform distribution over [0, 1].

2.1 Log-Likelihood Function and Its Gradients

Given N training data points for the visible layer, \(D_V{:}{=}\{ \mathbf {v} ^{(\mu )} \in \{-1,+1\}^{|V|} \mid \mu = 1,2,\ldots ,N\}\), the learning of RBM is done by maximizing the log-likelihood function (or the negative cross-entropy loss function), defined by

$$\begin{aligned} l_s(\theta ){:}{=}\frac{1}{N}\sum _{\mu = 1}^N \ln P_s( \mathbf {v} ^{(\mu )} \mid \theta ), \end{aligned}$$
(10)

with respect to \(\theta\) (namely, the maximum likelihood estimation). The distribution in the logarithmic function in Eq. (10) is the marginal distribution obtained in Eq. (5). The log-likelihood function is regarded as the negative training error. Usually, the log-likelihood function is maximized using a gradient ascent method. The gradients of the log-likelihood function with respect to the learning parameters are as follows.

$$\begin{aligned} \frac{\partial l_s(\theta )}{\partial b_i}&=\frac{1}{N}\sum _{\mu = 1}^N \text {v} _i^{(\mu )} -\langle v_i \rangle _s, \end{aligned}$$
(11)
$$\begin{aligned} \frac{\partial l_s(\theta )}{\partial c_j}&=\frac{1}{N}\sum _{\mu = 1}^N \psi _s\big (\lambda _j( \mathbf {v} ^{(\mu )},\theta )\big ) -\langle h_j \rangle _s, \end{aligned}$$
(12)
$$\begin{aligned} \frac{\partial l_s(\theta )}{\partial w_{i,j}}&=\frac{1}{N}\sum _{\mu = 1}^N \text {v} _i^{(\mu )} \psi _s\big (\lambda _j( \mathbf {v} ^{(\mu )},\theta )\big ) -\langle v_ih_j \rangle _s, \end{aligned}$$
(13)

where \(\langle \cdots \rangle _s\) is the expectation of RBM, i.e.,

$$\begin{aligned} \langle A(\varvec{v},\varvec{h}) \rangle _s{:}{=}\sum _{\varvec{v} \in \{-1,+1\}^{|V|}} \sum _{\varvec{h} \in \mathcal {X} (s)^{|H|}} A(\varvec{v},\varvec{h}) P_s(\varvec{v}, \varvec{h} \mid \theta ), \end{aligned}$$

and

$$\begin{aligned} \psi _s(x){:}{=}\frac{\partial }{\partial x}\ln \phi _s(x)= \left\{ \begin{array}{ll} \frac{s + 1}{s \tanh \{(s + 1)x/s\}} - \frac{1}{s \tanh(x/s)} &{} 1 \le s < \infty \\ \frac{1}{\tanh x} - \frac{1}{x} &{} s \rightarrow \infty \end{array}\right. . \end{aligned}$$
(14)

The log-likelihood function can be maximized by a gradient ascent method with the gradients expressed in Eqs. (11)–(13). However, the evaluation of the expectations, \(\langle \cdots \rangle _s\), included in the above gradients is computationally hard. The computation time of the evaluation grows exponentially as the number of variables increases. Therefore, in practice, an approximate approach is used, for example, CD [5], pseudo-likelihood [10], composite likelihood [16], Kullback–Leibler importance estimation procedure (KLIEP) [17], and Thouless–Anderson–Palmer (TAP) approximation [3]. In particular, the CD method is the most popular method. In the CD method, the intractable expectations in Eqs. (11)–(13) are approximated by the sample averages of the sampled points in which each sampled point is generated from the (one-time) Gibbs sampling using Eqs. (8) and (9), starting from each data point \(\mathbf {v} ^{(\mu )}\).

2.2 Numerical Experiment Using Artificial Data

In the numerical experiments in this section, we used two RBMs: the generative RBM (gRBM), \(P_1^{ \text {gen} }\), and the learning RBM (tRBM), \(P_s^{ \text {train} }\). We obtained \(N = 200\) artificial training data points, \(D_V\), from the gRBM using Gibbs sampling, and subsequently, we trained the tRBM using the data points. The sizes of the visible layers of both RBMs were the same, namely, \(|V| = 8\). The sizes of the hidden layers of the gRBM and tRBM were set to \(|H| = 4\) and \(|H| = 4 + R\), respectively. The sample space of the hidden variables in the gRBM was \(\mathcal {X} (1) = \{-1,+1\}\), implying that the gRBM is the binary-RBM. The parameters of gRBM were randomly drawn: \(b_i,c_j \sim G(0,0.1^2)\) and

$$\begin{aligned} w_{i,j}\sim U [-\sqrt{6/(|V|+|H|)},\sqrt{6/(|V|+|H|)}] \end{aligned}$$
(15)

(Xavier’s initialization [4]), where \(G(\mu ,\sigma ^2)\) is the Gaussian distribution and \(U[ \text {min} , \text {max} ]\) is the uniform distribution.

We trained the tRBM using the CD method. In the training, the parameters of tRBM were initialized by \(b_i = c_j = 0\) and Eq. (15). In the gradient ascent method, we used the full batch learning with the Adam method [7]. The quality of learning was measured using the Kullback–Leibler divergence (KLD) between the gRBM and tRBM:

$$\begin{aligned} \text {KLD} {:}{=}\frac{1}{|V|}\sum _{\varvec{v} \in \{-1,+1\}^{|V|}}P_1^{ \text {gen} }(\varvec{v}) \ln \frac{P_1^{ \text {gen} }(\varvec{v})}{P_s^{ \text {train} }(\varvec{v})}. \end{aligned}$$
(16)

The KLD is regarded as the (pseudo) distance between the gRBM and tRBM. Thus, it is a type of generalization error. We can evaluate the KLD (the generalization error) and log-likelihood function in Eq. (10) (the negative training error) because the sizes of the RBMs are not large.

Fig. 2
figure 2

KLDs against the number of parameter updates (epochs) when a\(R = 0\) and b\(R=5\). We used the tRBM with \(s = 1,2,4,\infty\). These plots show the average over 300 experiments

Fig. 3
figure 3

Log-likelihoods (divided by |V|) against the number of parameter updates (epochs) when a\(R = 0\) and b\(R=5\). We used the tRBM with \(s = 1,2,4,\infty\). These plots show the average over 300 experiments

Figure 2a, b show the KLDs against the number of parameter updates (i.e., the number of gradient ascent updates). We observe that all KLDs increase as the learnings proceed owing to the effect of over-fitting. In Fig. 2a, because the gRBM and tRBM have the same structure (in other words, there is no model error), the effect of over-fitting is not severe. In contrast, in Fig. 2b, because the tRBM is more flexible than the gRBM, the effect of over-fitting tends to become severe. In fact, in Fig. 2b, the KLDs increase more rapidly as the learnings proceed. The increase in the KLD of higher s is evidently slower. Figure3a, b show the log-likelihood functions divided by |V| against the number of parameter updates. We observe that the log-likelihood function with lower s grows more rapidly. In other words, the training error in the tRBM with lower s decreases more rapidly. These results indicate that the multivalued hidden variables suppress over-fitting. In these experiments, the tRBM with \(s = \infty\) is the best in terms of generalization.

2.3 Effect of Multivalued Hidden Variables

In the numerical experiments described in the previous section, we demonstrated that the multivalued hidden variables suppress over-fitting. In this section, we provide an insight into the effect of multivalued hidden variables using a toy example. Although the consideration presented below is for a simple RBM, which is significantly different from practical RBMs, it is expected to provide an insight into the effect of multivalued hidden variables.

First, let us consider a simple RBM with two visible variables:

$$\begin{aligned} P_s(v_1,v_2, \varvec{h} \mid w) = \frac{\omega (s)}{Z_s(w)}\exp \Big (w \sum _{i=1}^2\sum _{j \in H} v_ih_j\Big ). \end{aligned}$$
(17)

The marginal distribution of Eq. (17) is

$$\begin{aligned} P_s(v_1,v_2 \mid w)=\frac{\exp \big \{ |H| \ln \phi _s \big (w(v_1 + v_2)\big )\big \}}{ 2\exp \big \{ |H| \ln \phi _s(2w)\big \} + 2 \exp \big ( |H| \ln 2\big )}, \end{aligned}$$
(18)

where we used \(\lim _{x \rightarrow 0}\phi _s(x) = 2\) and \(\phi _s(x) = \phi _s(-x)\). Because \(v_1, v_2 \in \{-1,+1\}\), Eq. (18) can be expanded as

$$\begin{aligned} P_s(v_1,v_2 \mid w) =\frac{1+ m_{s,1}(w) v_1 + m_{s,2}(w) v_2 + \alpha _s(w)v_1v_2}{4}, \end{aligned}$$
(19)

where

$$\begin{aligned} m_{s,i}(w)&{:}{=} \sum _{v_1,v_2 \in \{-1,+1\}} v_i P_s(v_1,v_2 \mid w) = 0, \\ \alpha _s(w)&{:}{=} \sum _{v_1,v_2 \in \{-1,+1\}} v_1v_2 P_s(v_1,v_2 \mid w)\\&=\frac{\exp \big \{ |H| \ln \phi _s(2w)\big \} - \exp \big ( |H| \ln 2\big )}{\exp \big \{ |H| \ln \phi _s(2w)\big \} + \exp \big ( |H| \ln 2\big )}\ge 0. \end{aligned}$$

Next, we consider the empirical distribution of \(D_V\):

$$\begin{aligned} Q_{D_V}(v_1,v_2){:}{=} \frac{1}{N}\sum _{\mu =1}^N \prod _{i = 1}^2\delta \big (v_i, \text {v} _i^{(\mu )}\big ), \end{aligned}$$

where \(\delta (x,y)\) is the Kronecker delta function. Similar to Eq. (19), the empirical distribution is expanded as

$$\begin{aligned} Q_{D_V}(v_1,v_2) =\frac{1+ d_1 v_1 + d_2 v_2 + \beta v_1v_2}{4}, \end{aligned}$$
(20)

where \(d_i {:}{=} \sum _{\mu =1}^N \text {v} _i^{(\mu )} / N\) and \(\beta {:}{=} \sum _{\mu =1}^N \text {v} _1^{(\mu )} \text {v} _2^{(\mu )} / N\). For simplicity, in the following discussion, we assume that \(d_1 = d_2 = 0\) and \(\beta \ge 0\). Under this assumption, using the expanded forms in Eqs. (19) and (20), the log-likelihood function of the simple RBM is expressed by

$$\begin{aligned} l_s(w)&= \sum _{v_1,v_2 \in \{-1,+1\}}Q_{D_V}(v_1,v_2)\ln P_s(v_1,v_2 \mid w) \nonumber \\&=\sum _{v_1,v_2 \in \{-1,+1\}}\frac{1 + \beta v_1 v_2}{4}\ln \frac{1 + \alpha _s(w)v_1v_2}{4}. \end{aligned}$$
(21)

Ultimately, the aim of the maximum likelihood estimation is to find the value of w that realizes \(P_s(v_1,v_2 \mid w) = Q_{D_V}(v_1,v_2)\) or in other words, to find a value of \(w_s^*\) that satisfies \(\alpha _s(w_s^*) = \beta\). The log-likelihood function in Eq. (21) is globally maximized at \(w = w_s^*\) and the RBM with \(w_s^*\) over-fits the data distribution. It can be shown that the function \(\alpha _s(w)\) has the following three properties: (i) it is symmetric with respect to w, (ii) it monotonically increases with an increase in \(w \ge 0\), and (iii) it monotonically decreases with an increase in s when \(|x| \not =0\). The function \(\alpha _s(w)\) with \(|H| = 2\) is shown in Fig. 4 (a) as an example. These three properties lead to the inequality \(|w_s^*| < |w_{s+1}^*|\) for a certain \(\beta > 0\), which implies that the global maximum point of the log-likelihood function in Eq. (21) moves away from the origin, \(w = 0\), as s increases (see Fig. 4b).

Fig. 4
figure 4

a Plot of \(\alpha _s(w)\) versus w for various s when \(|H| = 2\). For \(\beta = 0.6\), the values of \(|w_s^*|\) for \(s = 1,2,4\), and \(\infty\) are approximately 0.6585, 0.7834, 0.8941, and 1.0887, respectively. b Plot of the log-likelihood function versus w for various s when \(|H| = 2\) and \(\beta = 0.6\). The shape of the peak around the global maximum point becomes sharper as s decreases

Usually, the initial value of w is set to a value around the origin. As shown in Fig. 4b, the global maximum point moves closer to the origin and the peak becomes sharper (in other words, the global maximum point becomes the stronger attractor) as s decreases. This implies that with a gradient ascent type of algorithm, the RBM with a lower s can reach the global maximum point more rapidly and causes over-fitting during an early stage of the learning. Whereas the convergence with the global maximum point of the RBM with a higher s is slower and it prevents over-fitting during an early stage of the learningFootnote 1. In fact, the increases in the generalization error (the KLD) and negative training error (the log-likelihood function) become faster as s decreases in the numerical results obtained in the previous section (cf. Figs. 2, 3).

From the above analysis, we found that the global maximum point moves away from the origin and becomes a weaker attractor as s increases. This could lead to some expectations, for example: (i) in a more practical RBM, its log-likelihood function usually has several local maximum points, and thus, the RBM with a higher s is more easily trapped by one of the local maximum points before converging with the global maximum point (namely, the over-fitting point) and (ii) some regularization methods, such as early stopping or \(L_2\) regularization, are more effective in the RBM with a higher s.

3 Application to Classification Problem

Let us consider a classification (or pattern recognition) problem in which an n-dimensional input vector \(\varvec{x} = (x_1, x_2,\ldots , x_n)^{ \text {T} } \in \mathbb {R}^n\) is classified into K different classes, \(C_1, C_2, \ldots , C_K\). It is convenient to use a 1-of-K representation (or a 1-of-K coding) to identify each class [1]. In the 1-of-K representation, each class corresponds to the K-dimensional vector \(\varvec{t} = (t_1, t_2,\ldots , t_K)^{ \text {T} }\), where \(t_k \in \{0,1\}\) and \(\sum _{k = 1}^K t_k = 1\), i.e., \(\varvec{t}\) is a vector in which the value of only one element is one and the remaining elements are zero. When \(t_k = 1\), \(\varvec{t}\) indicates class \(C_k\). For simplicity of the notation, we denote the 1-of-K vector, whose kth element is one, by \(\varvec{1}_k\). In the following section, we consider the application of the proposed RBM to the classification problem.

3.1 Discriminative Restricted Boltzmann Machine

A discriminative restricted Boltzmann machine (DRBM) was proposed to solve the classification problem [8, 9], which is a conditional distribution of the output 1-of-K vector \(\varvec{t}\) conditioned with a continuous input vector \(\varvec{x}\). The conventional DRBM can be obtained by a simple extension to the binary-RBM. The DRBM is obtained by the following process. The visible variables in the RBM are divided into two layers, the input and output layers. The K visible variables assigned to the output layer, \(\varvec{t}\), are redefined as the 1-of-K vector with \(\varvec{1}_k\) as its realization (i.e., \(\varvec{t} \in \{\varvec{1}_k \mid k = 1,2,\ldots , K\}\)), and the n visible variables assigned to the input layer, \(\varvec{x}\), are redefined as the continuous input vector (see Fig. 5). Subsequently, we make a conditional distribution conditioned with the variables in the input layer: \(P(\varvec{t}, \varvec{h} \mid \varvec{x})\). Finally, by marginalizing the hidden variables out, we obtain the DRBM: \(P(\varvec{t}\mid \varvec{x}) = \sum _{\varvec{h}}P(\varvec{t}, \varvec{h} \mid \varvec{x})\).

Fig. 5
figure 5

Discriminative restricted Boltzmann machine is obtained to an extension of the RBM. Because the output layer corresponds to the 1-of-K vector, it takes only K different states. For distinction, the couplings between the input and hidden layers are represented by \(\varvec{w}^{(1)}\) and those between the hidden and output layers are represented by \(\varvec{w}^{(2)}\)

By using the proposed RBM instead of the binary-RBM, we obtain an extension to the conventional DRBM, i.e., we obtain a DRBM with multivalued hidden variables. The proposed DRBM for \(s \in \mathbb {N}\) is obtained by

$$\begin{aligned} P_s(\varvec{t} \mid \varvec{x},\theta )&{:}{=}\frac{1}{ \mathcal {Z} _s(\varvec{x},\theta )} \exp \Big (\sum _{k=1}^K b_kt_k + \sum _{j \in H} \ln \phi _s\big (\zeta _j(\varvec{t},\varvec{x}, \theta ) \big )\Big ). \end{aligned}$$
(22)

where \(\zeta _j(\varvec{t},\varvec{x}, \theta ){:}{=} c_j + \sum _{k=1}^K w_{j,k}^{(2)}t_k + \sum _{i=1}^n w_{i,j}^{(1)}x_i\) and

$$\begin{aligned} \mathcal {Z} _s(\varvec{x},\theta ){:}{=}\sum _{k=1}^K \exp \Big (b_k + \sum _{j \in H} \ln \phi _s\big (\zeta _j(\varvec{1}_k,\varvec{x}, \theta ) \big )\Big ). \end{aligned}$$
(23)

The function \(\phi _s(x)\) appearing in Eqs. (22) and (23) is already defined in Eq. (6). It is noteworthy that when \(s = 1\), Eq. (22) is equivalent to the conventional DRBM proposed in Ref. [8]. Eq. (22) is regarded as the class probability, indicating that \(P_s(\varvec{t} = \varvec{1}_k \mid \varvec{x},\theta )\) is the probability of the input \(\varvec{x}\) belonging to class \(C_k\). The input \(\varvec{x}\) should be assigned into a class that gives the maximum class probability.

Given N supervised training data points, \(D{:}{=}\{( \mathbf {t} ^{(\mu )} , \mathbf {x} ^{(\mu )}) \mid \mu = 1,2,\ldots , N\}\), the log-likelihood function of the proposed DRBM in Eq. (22) is defined as

$$\begin{aligned} l_s^{\dagger }(\theta ){:}{=}\frac{1}{N}\sum _{\mu =1}^N \ln P_s( \mathbf {t} ^{(\mu )} \mid \mathbf {x} ^{(\mu )},\theta ). \end{aligned}$$
(24)

The gradients of the log-likelihood function with respect to the parameters are obtained as follows.

$$\begin{aligned} \frac{\partial l_s^{\dagger }(\theta )}{\partial b_k}&=\frac{1}{N}\sum _{\mu = 1}^N \big [ \text {t} _k^{(\mu )} - P_s(\varvec{1}_k \mid \mathbf {x} ^{(\mu )},\theta )\big ], \end{aligned}$$
(25)
$$\begin{aligned} \frac{\partial l_s^{\dagger }(\theta )}{\partial c_j}&=\frac{1}{N}\sum _{\mu = 1}^N \big [\psi _s\big (\zeta _j( \mathbf {t} ^{(\mu )}, \mathbf {x} ^{(\mu )}, \theta )\big ) -\langle \psi _s\big (\zeta _j(\varvec{t}, \mathbf {x} ^{(\mu )}, \theta ) \rangle _{\varvec{t}}^{(\mu , s)} \big ], \end{aligned}$$
(26)
$$\begin{aligned} \frac{\partial l_s^{\dagger }(\theta )}{\partial w_{i,j}^{(1)}}&=\frac{1}{N}\sum _{\mu = 1}^N \text {x} _i^{(\mu )} \big [\psi _s\big (\zeta _j( \mathbf {t} ^{(\mu )}, \mathbf {x} ^{(\mu )}, \theta )\big ) -\langle \psi _s\big (\zeta _j(\varvec{t}, \mathbf {x} ^{(\mu )}, \theta ) \rangle _{\varvec{t}}^{(\mu , s)}\big ], \end{aligned}$$
(27)
$$\begin{aligned} \frac{\partial l_s^{\dagger }(\theta )}{\partial w_{j,k}^{(2)}}&=\frac{1}{N}\sum _{\mu = 1}^N \psi _s\big (\zeta _j(\varvec{1}_k, \mathbf {x} ^{(\mu )}, \theta )\big ) \big [ \text {t} _k^{(\mu )} - P_s(\varvec{1}_k \mid \mathbf {x} ^{(\mu )},\theta )\big ], \end{aligned}$$
(28)

where \(\langle \cdots \rangle _{\varvec{t}}^{(\mu , s)}\) denotes the expectation defined by

$$\begin{aligned} \langle A(\varvec{t}) \rangle _{\varvec{t}}^{(\mu , s)}{:}{=}\sum _{k=1}^K A(\varvec{1}_k) P_s(\varvec{1}_k \mid \mathbf {x} ^{(\mu )},\theta ). \end{aligned}$$

The function \(\psi _s(x)\) appearing in the above gradients is already defined in Eq. (14). It is noteworthy that the gradients expressed in Eqs. (25)–(28) are computed without an approximation, unlike those in the RBM, owing to the special structure of DRBM. In the training, we maximize \(l_s^{\dagger }(\theta )\) with respect to \(\theta\) using a gradient ascent method with Eqs. (25)–(28).

3.2 Numerical Experiment Using MNIST Data Set

Fig. 6
figure 6

Missclassification errors against the number of parameter updates (epochs): a training error and b test error. Here, one epoch consists of one full update cycle over the training data set, implying that one epoch involves \(N /B = 10\) updates by the SGA in this case. We used the DRBM with \(s = 1,\infty\). These plots show the average over 120 experiments

In this section, we show the results of the numerical experiment using MNIST. MNIST is a data set of 10 different handwritten digits, \(0, 1, \ldots ,\) and 9 and is composed of 60, 000 training data points and 10, 000 test data points. Each data point includes the input data, a \(28 \times 28\) digit (8-bit) image, and the corresponding target digit label. Therefore, for the data set, we set \(n = 784\) and \(K = 10\). All input images were normalized by dividing by 255 during preprocessing.

We trained the proposed DRBM with \(|H|=200\) using \(N = 1000\) training data points in MNIST and tested it using 10000 test data points. In the training, we used the stochastic gradient ascent (SGA), for which the mini-batch size was \(B = 100\), with the AdaMax optimizer [7]. All coupling parameters were initialized by the Xavier method [4], and all bias parameters were initialized by zero. Figure 6 shows the plots of the missclassification rates for (a) training data set and (b) test data set versus the number of parameter updates. All input images in the test data set were corrupted by the Gaussian noise with \(\sigma = 120\) before the normalizationFootnote 2. We observe that the DRBM with \(s = \infty\) is better in terms of generalization because it shows a higher training error and lower test error. This indicates that the multivalued hidden variables are also effective in the DRBM.

4 Conclusion

In this paper, we proposed an RBM with multivalued hidden variables, which is a simple extension to the conventional binary-RBM and showed that the proposed RBM is better than the binary-RBM in terms of the generalization property via numerical experiments conducted on CD learning with artificial data (in Sect. 2.2) and classification problem with MNIST (in Sect. 3.2).

It is important to understand the reason why the multivalued hidden variables are effective in terms of over-fitting. We provided a basic insight into it by analyzing a simple example in Sect. 2.3. However, practical RBMs are much more complex than the simple example used in this study. Therefore, we need to perform further analysis to clarify this reason. We think that a mean-field analysis [11] can be used to perform the further analysis. Moreover, a criteria for over-fitting was provided in Ref. [2]. The relationship between the criteria and our multivalued hidden variables is also interesting. These issues will be addressed in our future studies.