FPGA Implementation of Hardware-Oriented Chaotic Boltzmann Machines

Boltzmann machines (BMs) are useful in various applications but are limited by their requirement to generate random numbers. In contrast, chaotic Boltzmann machines (CBMs) are neural networks that imitate the stochastic behavior of BMs with the chaotic dynamics and deterministic behavior, without random numbers. CBMs can potentially require fewer hardware resources than the original algorithms due to the unnecessity of random number generators. In this study, hardware-oriented algorithms and a differential multiply-accumulation operation are proposed to overcome the difficulties of implementing CBMs on field-programmable gate arrays (FPGAs). A hardware-oriented algorithm for CBMs, which includes fixed-point operations and shift operations, is proposed to reduce hardware resource utilization in the implemented circuits. In particular, the differential multiply-accumulate operation allows us to implement the multiply-accumulate operation with block random access memory and digital signal processors to reduce the consumption of lookup tables and flip-flops in FPGAs without losing the calculation speed. Our proposed approach was evaluated in numerical simulations, logical synthesis, and FPGA implementation. The calculation speed of FPGA-implemented CBMs was compared with software-implemented CBMs, which resulted in 1 / 6,500 of calculation time reduction in a 300-neuron CBM. Moreover, 2,048 neurons of CBM were realized by the logical synthesis. Therefore, the proposed hardware implementation of CBMs was shown to be feasible. The proposed CBMs can solve combinatorial optimization problems at a larger scale with fewer resources.


I. INTRODUCTION
Boltzmann machines (BMs) [1]- [3] are fully connected neural networks used for numerous purposes, such as time-series data prediction and image classification [4]. BMs have applications in various fields, such as image recognition, sound recognition, and natural language processing [5]. In many studies, BMs exceeded conventional algorithms in image processing tasks (such as edge detection, caption generation, and object tracking). A remarkable feature of BMs is that the model can learn the relationships in data, which enables us to build applications that learn generative models of training data, such as automatic music generation [6], [7]. However, BMs have a high computational cost, and graphic processing units (GPUs) are frequently used to accelerate the calcu-The associate editor coordinating the review of this manuscript and approving it for publication was Muhammad Asif . lation speed, albeit GPUs require much more power than CPUs. BMs can be run at high speed with less energy by implementing a dedicated circuit into hardware, such as a field-programmable gate array (FPGA) and a complementary metal-oxide-semiconductor (CMOS) chip. Despite the benefits of a hardware implementation of BMs, they consume a significant amount of hardware resources because random number generators must be included in the implemented hardware.
The Ising models [8] were known as an equivalent model of BMs. Ising models are used for solving combinatorial optimization problems, which are often recognized as NPhard and require a tremendous amount of time to solve with Von Neumann architecture computers. Combinatorial optimization problem solvers are recognized as a type of quantum computations [9], [10], and Ising models are generally used for this purpose [11], [12]. Solving combinatorial optimization problems is a critical requirement to solve problems in our society and artificial intelligence (AI). Many hardware implementations of Ising models for accelerating the calculations were reported [13], [14]. However, the necessity of random numbers in Ising models often deteriorates the efficiency of hardware implementations and BMs. To avoid this problem, implementation methods that do not require random numbers [15] and share random number generators [14] were proposed.
In this study, chaotic Boltzmann machines (CBMs) [16] are introduced to solve the problem of the requirement of random numbers in BMs and Ising models. CBMs are neural networks that imitate BMs' stochastic behavior by non-linear and chaotic dynamics. The deterministic behavior of CBMs can be implemented into hardware without random numbers, unlike in Ising models and BMs; therefore, the replacement of Ising models and BMs with CBMs enables us to implement applications with fewer hardware resources than the original models. The number of hardware circuits is reduced, owing to the removal of random number generators. This enables us to enlarge the scale of the network implemented in one FPGA or chip. Hence, larger-scale combinatorial optimization problems can be solved, as the resource utilization has been reduced. CBMs have been implemented in an analog chip, and their efficiency was reported [17], [18].
However, there are difficulties implementing the original algorithm of CBMs, such as the use of floating-point numbers and multiply-accumulate operation (the multiplication is conducted for a binary value and a decimal value in CBMs). In this study, hardware-oriented algorithms and a differential multiply-accumulate operation are proposed to solve those difficulties. These proposals reduce the hardware resource utilization of the implementation circuits of CBMs without a significant impact on accuracy and the calculation time. The proposals were verified with numerical simulations and evaluated by comparing with the conventional implementations. Our experimental results show that the hardware-oriented algorithms and the differential multiply-accumulation can reduce a remarkable amount of hardware resource consumption; as a result, a 2,048 nodefully-connected network was synthesized into a single FPGA. Additionally, the calculation speed of our hardware implementation was compared with software implementation. It was revealed that the hardware implementation accelerated the calculation speed at most 6,500 times as fast as software implementation. Finally, the comparison with related works revealed the ascendancy of this work on the hardware implementation scale; our hardware-implemented CBM has the most edges in the related works.
The remainder of this article is organized as follows. The algorithm of CBMs is explained in Section II. Our proposals (i.e., the hardware-oriented algorithms and the differential multiply-accumulation), designed CBM circuits, and our implemented system are described in Section III. The results of numerical simulations, logical synthesis, and the comparison of software and FPGA-implemented CBM are  presented in Section IV. Additionally, an evaluation of our implementation based on hardware resource consumption and the calculation speed, and the comparison with related studies are described in Section V.

II. CHAOTIC BOLTZMANN MACHINES
BMs are neural networks with stochastic behavior [1]. BMs are composed of neurons whose output changes with the following probability: where P[s i = 1] refers to the probability that the output of ith neuron becomes 1. Moreover, s i ∈ {0, 1}, z i , and T refer to the output value, input value, and temperature of ith neuron, respectively. As shown in the equation, random number generators are required to produce the outputs of neurons stochastically when the BMs are implemented into hardware. Chaotic Boltzmann machines (CBMs) are neural networks that behave deterministically by imitating the stochastic behavior of BMs by the chaotic dynamics [16]. Therefore, random number generators are required for each neuron when BMs are implemented into hardware, owing to their stochastic behavior. In contrast, CBMs do not require random number generators because of their deterministic calculations, which reduces the use of hardware resources compared to BMs. CBMs operate in continuous time and have an additional parameter for each neuron, called an internal state. The neuron's output is determined using this internal state. The behavior of the internal state is described by the following differential equation: where s i refers to the state of the neuron. Fig. 1 illustrates the time change of the internal state and the output. As shown in the figure, the internal state continues oscillating between 0 and 1, and the change of the output occurs accordingly. The time change of the internal state and the output in the case the input value z i is fixed, as shown in Fig. 2. In the figure, α and β are defined as time widths when the output is 1 and 0, respectively. By the incline of x i described in (2), the time widths are described as follows: The time width when the output is 1 in the whole calculation time of the CBM is calculated as which illustrates the ratio matches to the firing probability of BMs' neurons described in (1). For this reason, it is apparent that a CBM is a neural network model that expands the stochastic behavior of BMs into the time axis.

A. HARDWARE-ORIENTED ALGORITHMS
Hardware-oriented algorithms of CBMs are proposed in this study. The algorithms are composed of a fixed-point operation and shift operation, which enables the FPGAimplemented CBMs to reduce hardware resource consumption. Fixed-point numbers are used to replace floating-point numbers. The original algorithm of CBMs assumes the use of floating-point numbers to represent decimal values, which are expressed with a sign bit, significant bits, and exponent bits. There is a disadvantage that implemented circuits of floatingpoint numbers tend to be more complicated than the ones of fixed-point numbers. However, floating-point numbers can represent a wide range of numbers due to the exponential expression; mainly, the phenomena are critical in adder circuits. The operation is that the circuit needs to adjust the digits to two inputs before and after the addition. On the other hand, fixed-point numbers are expressed with integer bits and fraction bits, which do not use exponential expressions. Therefore, the replacement of floating-point numbers with fixed-point numbers enables implementation circuits to be simplified because of the identity for numerical calculation circuits for integer numbers. The simplification of digital calculation circuits of CBMs reduces a significant number of hardware resources for the implementation.
The exponential operation in (2) is replaced with the shift operation, as shown in Fig. 3a. To implement transcendental functions (e.g., sin, cos, and exp) into digital circuits, dedicated circuits for the approximation of those functions are frequently used, such as the table approximation and the coordinate rotation digital computer (CORDIC) [19]. Table approximation, which calculates functions by storing the output values in memory devices, often combined with linear interpolation or quadratic interpolation, consumes an enormous amount of memory resources. Additionally, CORDIC, which expresses the transcendental functions as a combination of addition, subtraction, and table reference, has a disadvantage in iterative operations. In contrast, the shift operation, shown in Fig. 3b, is represented as and it can be implemented with a single barrel shifter circuit. Therefore, by replacing dedicated circuits for the calculation of exponentials, the consumption of hardware resources can be reduced. Moreover, the barrel shifter's calculation speed, which does not have iterative calculations, is faster than that of CORDIC. The proposed hardware-oriented algorithm affects the computing ability of CBMs. The impact was confirmed by numerical simulations in this study.

B. DIFFERENTIAL MULTIPLY-ACCUMULATION
In this study, hardware resource consumption is reduced by time-division. Additionally, the differential multiplyaccumulation is proposed to restrain the increasing calculation speed resulting from the time-division of the multiplyaccumulation. The number of lookup tables (LUTs) and flipflops (FFs) consumption, which increases in proportion to the number of neurons in a network, is reduced to a single digital signal processor (DSP) and a single random access memory (RAM) by the time-division of the calculation. In 204362 VOLUME 8, 2020 our proposal, the increased calculation time due to the timedivision is reduced by the omitted calculation steps, with a focus on the difference between the former output and the present output of a network.
Multiply-accumulation is a common obstruction of most neural network implementations. In operation, the summation of multiplied values of outputs of neurons connected to a neuron, and synaptic weights between the neurons and other neurons, needs to be calculated. In CBMs' calculation, a neuron's output is represented by a binary value (i.e., 0 or 1); therefore, the multiplication can be implemented with a multiplexer instead of a multiplier circuit. However, the number of adder circuits (proportional to the number of neurons) is essential to calculate the multiply-accumulation owing to the fully-connected network topology. Hence, as the size of the network increases, the number of adder circuits expands in proportion to the square number of neurons. To overcome the problem, a method to reduce the number of adder circuits is proposed as the differential multiply-accumulation.
In the conventional multiply-accumulation, the input of ith neuron z i in the network is described as where N , b i , s j , and w ij , respectively, represent the number of neurons in the neural network, the bias of ith neuron, the output of jth neuron, and the synaptic weight value w ij between ith neuron and jth neuron. In the adder circuits of the multiply-accumulation circuit, adder circuits with two inputs are arranged into a tree structure, as shown in Fig. 4a. The circuit has a disadvantage that the number of adder circuits increases in proportion to the number of neurons. Fig. 4c shows a part of a network circuit implemented with the conventional multiply-accumulation. In this network circuit, neuron circuits are connected via synapse circuits, whose registers hold the synaptic weight. The synapse circuit outputs the weight value when the input value of the circuit is 1; otherwise, 0 is output. The number of synapse circuits and adders required to implement the network increases in proportion to the square number of neurons. The time-division is introduced to replace the conventional multiply-accumulation. The synapse circuits and the adder circuits in Fig. 4c are replaced with a single block RAM and a single embedded DSP in FPGAs. In the time-division method, the input value of ith neuron is defined as where z i , τ ∈ [1, N ], and A τ i represent the input value, iterator of the time-division, and accumulated value at time τ , respectively. The multiply-accumulation circuit with timedivided addition is constructed as shown in Fig. 4b. The addition circuit holds the accumulated value with a register and adds the input value into the accumulated value sequentially. Fig. 4d shows a part of the network circuit implemented with the time-divided multiply-accumulation. In the circuit, the iteration of the time-division is implemented as two multiplexers whose select signal is τ . Next, the output of neuron s τ and the corresponding synaptic weight w iτ are input to the addition circuit one by one.
The behavior of the network circuit implemented for our proposal is shown in Fig. 5. The network's weight matrix is expressed by an array of RAMs in the left side of the figure.
Notably, the diagonal elements of the weight matrix are used to store biases of neurons, to reduce extra resources to store the biases, and this architecture is reflected in (9). As shown in the figure, multiplied values of outputs in the networks and synaptic weights are accumulated as τ iterates from 1 to N . However, the time-division of multiply-accumulation increases the calculation time in proportion to the number of neurons.
The differential multiply-accumulation is proposed to overcome the problem of increased calculation time. In the neuron circuit with the conventional multiply-accumulation, as shown in Fig. 6a, the output value of neuron circuits, s i , and synaptic weight, w ij , circuits are input to the neuron circuits one by one. Therefore, these iterations in calculations increase processing time. On the other hand, the proposed differential multiply-accumulation improves the calculation speed, as shown in Fig. 6b. In the circuit, the iteration covers the neurons whose output changed. The former input of a neuron z t−1 i is updated as the current input z t i with the difference between the former output of the network s t−1 and the present one s t . The calculation is described by the following equations: where t, z t i , and d t j represent time, the input of ith neuron at time t, and the difference between the former output of jth neuron s t−1 j and the present output for s t j , respectively. The equation indicates that the first multiply-accumulation is the same as in the original calculation (8). However, iterator j is replaced with j , which only covers the neurons whose output has changed. The difference value d t j replaces s t j to update the present input value z t i . The replacement of iterator j with j reduces the number of iterations significantly because every neuron changes its output twice in a period, as shown    in Fig. 2. The low-frequent output change of neurons of CBMs increases the efficiency of the differential multiplyaccumulation.

C. DESIGNED CIRCUITS
FPGA circuits are designed to compare hardware resource consumptions of the conventional method and our proposal. First, neuron circuits that use both the conventional and our hardware-oriented algorithms are designed to evaluate our proposed approach. Second, a conventional network circuit and a network circuit that uses our differential multiplyaccumulation are designed to compare hardware resource usage. Note that both the conventional and the proposed network circuit have our hardware-oriented algorithm in their neuron circuits.
A neuron circuit designed using the conventional method (i.e., floating-point numbers and exponential operations) is shown in Fig. 7. IP cores produced by Xilinx are used as floating-point operation circuits, which include an exponential calculator, adders, multipliers, and comparators, and are represented by the broken lines in the figure. Double lines in the figure show handshake-protocol bus lines (i.e., AXI Stream), which are used to connect Xilinx IP cores. The fork module is used for distributing data and is required to arbitrate control signals. The former internal state and the former output of ith neuron are represented as x i and s i , respectively. Moreover, x i and s i refer to the current internal state and the current output, respectively. Operation sign changes the sign of input of a neuron depending on the output of the neuron as The calculation steps of the neuron circuit are as follows: 1) Calculate z i /T by the multiply-accumulation and multiplying the inverted temperature 1/T . 2) Calculate dx i by multiplying the right-hand side of (2) and dt. 3) Calculate x i + dx i by adding dx i to x i and calculating x i and s i as  Clip x i to a value between 0 and 1. 4) Update x i and s i with x i and s i to use them in the next calculation of the neuron. 5) Output s i .
A neuron circuit is designed using our hardware-oriented algorithm (i.e., fixed-point operation and shift operation), and its structure is shown in Fig. 8. In the circuit, four types of fixed-point operation circuits are included: a shift operator, adders, multipliers, and comparators. Dashed lines' input to two multiplexers represents the overflow signal of fixed-point operation and the transition signal of the neuron's output. The former output and the former internal state of the ith neuron are denoted as x i and s i , respectively. Moreover, x i and s i refer to the current internal state and the current output, respectively. In the designed circuit, the CBM algorithm is modified as shown in Fig. 9a. Next, s i can be updated by the carry bit of x i when the bits of s i and x i are connected and dx i are added, as shown in Fig. 9b. The calculation steps of the neuron circuit are as follows: 1) Calculate z i /T by the multiply-accumulation and multiplying the inverted temperature 1/T . 2) Calculate dx i by multiplying the right-hand side of (2) and dt, monitoring overflows.
, if the overflow has occurred in this step (the output needs to be changed immediately, not to delay the timing of the output change); • {s i , 0} + dx i , if the transition has occurred (this condition means that output s i was changed in the update, and x i needs to be reset to 0). Here, {a, b} means the bit connection of bit vectors a and b. In addition, a refers the bit complementation of a bit a.
The network circuit designed using the conventional multiply-accumulation has the structure shown in Fig. 10a. This circuit is composed of N neuron circuits and N (N −1)/2 synapse circuits, and all neuron circuits are interconnected through synapse circuits. A synapse circuit that connects two neurons is composed of registers that store a synaptic weight and two multiplexers, which express multiplications s i w i and s j w i , as shown in Fig. 10b. A neuron circuit is shown in Fig. 10c; the neuron circuit included adders to conduct multiply-accumulation and a multiplier beside a calculation circuit of dx i /dt. The synapse circuit and the multiply-accumulation of the neuron circuit are implemented with LUTs and registers.
The network circuit designed using the proposed method is configured as shown in Fig. 11a. This circuit consists of N neuron circuits, N synapse circuits, and a network controller circuit. A synapse circuit that stores synaptic weights for a neuron can be represented by a single block RAM of FPGAs, shown in Fig. 11b. Its input is τ in Fig. 4d, and its output is a bias or synaptic weight. The multiply-accumulation in the neuron circuit of the proposed network circuit is expressed as a DSP block of FPGAs, as shown in Fig. 11c. A DSP block of Xilinx's products, such as DSP48E, has a multiplexer and an arithmetic and logic unit (ALU) that can be used as an accumulator. Inputs of the neuron circuit are the bias or synaptic weight, the inverted temperature, and control signals. The control signals are generated in the network controller circuit using the value of τ and the number of iterations, calculated as follows: initialize Put 0 to the accumulated value by (9). enable Put 0 as the input value when the bias is input after the second iteration because the bias value changes at the first iteration by (10). negate Control whether the synaptic weight is added or subtracted by (11).

D. IMPLEMENTED SYSTEM
An FPGA-implemented CBM is controlled by software in the host PC in this system, and an FPGA and the host PC are connected by a PCIe interface. A Xillybus [20] IP core is used for the communication between the PC and the FPGA. Synaptic weights, biases, initial outputs of neurons, parameters (such as the frequency of annealing, initial temperature, and number of iterations), and control signals are input from the host PC.
The FPGA outputs the output of the network and its energy and the status of the circuit. As shown in Fig. 12, the FPGAimplemented CBM is composed of a network circuit and the following circuits: scheduler Decide the timing of annealing.   controller Control the calculation in the network. temperature generator Calculate temperature values for the network. In the network circuit, all neuron circuits and synapse circuits are connected to the host PC via multiplexers and demultiplexers and are accessed directly by the PC.

A. NUMERICAL SIMULATION
Numerical simulations were conducted to evaluate the hardware-oriented algorithm. For the simulations, conventional CBMs (which have algorithms with exponential operations and floating-point numbers) and the proposed CBMs (which are implemented with shift operations and fixed-point numbers) are implemented as software in the environment shown in Table 1. The Euler method was used to update the internal states and outputs of the neurons of CBMs. Subsequently, the parameters shown in Table 2 were used, and the annealing was conducted every N /128 of time in the Euler method because CBM is a neural network that runs in continuous time, as illustrated in (2).
The maximum-cut problem can be solved by CBMs whose synaptic weights and biases are set as follows: VOLUME 8, 2020 where w ij , b i , and d ij refer to the weight value between ith and jth neurons, the bias value of ith neuron of the CBMs, and the edge weight between ith node and jth node of the problems, respectively [16]. Temperature, T , is decreased gradually from the initial value to a sufficiently low value when running CBMs. Solutions of maximum-cut problems are obtained as the combination of outputs of neurons in a CBM network when T becomes sufficiently low. The computing ability of CBMs implemented with the proposed method has been measured with the maximum cut problem. Synaptic weights and biases were set according to (13). CBMs were run while the temperature decreases from the initial value to a sufficiently low value. Error rates were calculated as follows: where E min and E opt are defined as the minimum energy value acquired from CBMs and the optimized energy value of problems, respectively. In the numerical simulation, CBMs were applied ten times for each of the six different maxcut problems in Biq Mac Library [21], shown in Table 3, which has 100, 150, 200, 250, and 300 nodes with different initial values. The average and minimum error rates for each problem size were calculated. First, two kinds of CBMs that use 64 bits of floating-point numbers and 24 bits of fixed-point numbers were applied to the max-cut problems for verifying the computing ability of CBMs implemented with fixed-point numbers. Fig. 13a  and 13b show the results. A significant difference was not identified between the use of floating-point numbers and fixed-point numbers for any number of nodes of problems, although some variations depended on the number of nodes.
Second, the computing ability of CBMs implemented with the shift operation was verified by applying CBMs implemented with the exponential operation and the shift operation to the max-cut problems. Figs. 14a and 14b show the result. These figures indicate that the computational ability of CBMs implemented with the shift operation is slightly inferior to CBMs implemented with the exponential operation; however, the difference in computational ability is only a few points. Consequently, the replacement of the exponential operation with the shift operation does not significantly impact the computational ability of CBMs.
Finally, CBMs whose bit widths of fixed-point numbers are 8, 12, 16, 20, and 24 were applied to the max-cut problems to evaluate the correlation between the computational ability   of CBMs and the bit width of fixed-point numbers. In the experiment, 2 bit width of fixed-point numbers/2 is used as the step size of the time of the Euler method because the stem size is limited by the bit width of the fixed-point numbers. Figs. 15a and 15b illustrate the results, which demonstrate that the error rate increases as the bit width decreases. The results indicate that the error rate could be maintained at approxi-mately 10% if the bit width of fixed-point numbers is more than 16 bits.
The results of the numerical simulation revealed that the hardware-oriented algorithm impacted the computing ability of CBMs. However, a few points of difference in the error rate between the original algorithm and the hardware-oriented algorithm remained. The sufficiently     small difference indicates the feasibility of the proposed hardware-oriented algorithm.

B. LOGICAL SYNTHESIS
The resource utilization of the circuits designed with conventional algorithms and hardware-oriented algorithms were compared to verify the resource utilization of a neuron circuit. The neuron circuits of the network, whose number of neurons are 4, 8, 16, 32, and 64 were designed and synthesized using logical synthesis. The number of LUTs, FFs, DSPs, and RAMs used in the designed circuit was measured from the results of the synthesis. The synthesis environment is shown in Table 4. The bit width of the synaptic weight, bit width of the fixed-point number of the neuron circuit, and step size of the time were set to 16, 16, and 2 −8 , respectively. Fig. 16 shows the results. In the neuron circuit that used the hardware-oriented algorithm, the number of resources used for LUT and FF is reduced for any number of neurons. According to the results, the resources can be reduced to   nearly 1/30, and the amount of DSP consumed in the neuron circuit designed using the conventional algorithm can be reduced. Note that block RAMs are not used in the designs of the conventional and the proposed methods.
The resources of the network circuit designed with conventional multiply-accumulate operations and differential multiply-accumulate operations were compared to validate the differential multiply-accumulate operation. The network circuits with 4, 8, 16, 32, and 64 neurons were designed and synthesized by logical synthesis. The number of LUTs, FFs, DSPs, and RAMs used in the designed circuit was measured from the synthesis results. The logical synthesis environment is shown in Table 5. The neuron circuit was designed using the hardware-oriented algorithm. Fig. 17 shows the results for this step. In the network circuits using differential multiply-accumulate operations, the increase in the number of resources used for LUTs and FFs, for the number of neurons, was smaller than in the network circuits designed using conventional multiplyaccumulate operation. Moreover, in the network circuit with 64 neurons, the number of resources used for LUT and FF was reduced to approximately 1/8 and approximately 1/70, respectively. Note that the numbers of DSPs on the conven- tional network and the proposed network were the same. Block RAMs were not used in the conventional network.
The proposed network was synthesized for an accelerator card. The scale of the system, which can be implemented into an accelerator card, was measured. Fig. 6 shows the synthesis environment for the card. The experiment's condition was the same as in the previous experiment.
The results are shown in Table 7 and Fig. 18. The results indicate that resource utilization increases polynomially as the number of neurons increases, and the usage of block RAMs dominates other hardware resources, such as FF, LUT, and DSP in the device. As a result of the synthesis, 2, 048 neurons in the network could be implemented in the accelerator card device.
The results of the logical synthesis show the efficiency of the hardware-oriented algorithm and the differential multiply-accumulation. Therefore, these methods reduce the hardware resource utilization drastically.

C. FPGA IMPLEMENTATION
A CBM circuit was implemented in the environment shown in Table 8. The implemented circuit is shown in Fig. 12. The VOLUME 8, 2020   FPGA-implemented CBM was compared with the softwareimplemented CBM in terms of calculation ability and speed.
The computing ability of CBMs implemented into software and hardware was compared for the verification of  FPGA-implemented CBM. The CBMs were applied to 100, 150, 200, 250, and 300 nodes of the maximum cut problems. The average error rate was measured as in numerical simulations (i.e., applying it ten times for each six different max-cut problems in Biq Mac Library). In this experiment, the inverted value of temperature was used instead of temperature because the FPGA-implemented CBM needs to input the inverted value of temperature to the neuron circuit. In the experiment, annealing was performed every N /128 time period in the Euler method, using the values shown in Table 9. To align the conditions, the same synaptic weights and initial outputs of neurons as in the FPGAimplemented CBM were input to the software-implemented CBM. Note that the software-implemented CBM does not use the hardware-oriented algorithm to compare the calculation abilities of the original CBM and the FPGA-implemented CBM. Fig. 19b shows the experimental results. The error rates of FPGA-implemented CBMs are found to be slightly higher than those of software-implemented CBMs. However, it was revealed that the change in error rate due to hardware implementation was less than two points. Moreover, the time change of the energy function of the software and FPGAimplemented CBM when applied to the problem ''ising3.0-100_5555'' is as shown in Fig. 19a and Fig. 19c, respectively. The time on the horizontal axis represents the Euler method's elapsed time, and a broken line indicates the optimum value of the applied problem. These figures indicate that the energy functions of all CBMs change similarly. As time passes, the value of the energy function approaches the optimum value as the temperature decreases.
The calculation speed of the software-implemented CBM and two kinds of FPGA-implemented CBM (i.e., a CBM with (8) as the multiply-accumulation and a CBM with (10) as the multiply-accumulation) were compared to evaluate the calculation speed of CBMs. The calculation time of FPGA-implemented CBM and the software-implemented CBM were measured. The average values of 60 iterations for each number of neurons were calculated. Fig. 19d shows the experimental results of this step. The results show that the calculation speed was significantly improved by implementing a CBM to an FPGA, and the differential multiply-accumulation contributed to improving the calculation speed. The calculation speed was reduced to at least 1/1, 300, and the calculation speed was reduced to 1/6, 500 in the case of 300 neurons.
Therefore, the results of the FPGA implementation indicate the effectiveness of the FPGA-implemented CBM. The FPGA-implemented CBM can perform much faster calculations than the software-implemented CBM.

A. RESOURCES CONSUMPTION
The results of this study demonstrated that our proposed method reduces resource consumption. By using the hardware-oriented algorithm, the consumption of LUTs and FFs was reduced to nearly 1/30, as indicated in Fig. 16. The results imply that the calculation circuits on the neuron circuits are simplified using fixed-point numbers and the shift operations instead of floating-point numbers and the exponential operation; notably, the utilization of DSPs due to floating-point numbers' high bit precision and calculations of exponential bits is decreased. The multiply-accumulate operation also contributed to reducing resource consumption; the consumption of LUT and FF was reduced to approximately 1/8 and 1/70, respectively, in the network circuit with 64 neurons, as shown in Fig. 17. The results indicate that the increase in LUTs and registers in proportion to the square number of neurons in the network is replaced with the number of DSPs and RAMs in proportion to the number of neurons in our proposed approach. The maximum size of the network synthesized in this study was 2, 048, as shown in Fig. 18. According to this result, the number of block RAMs in the FPGA limits the size of the network due to the dominating utilization of RAMs. It may be possible to further increase the size of the network by reducing the bit precision and assigning multiple weight values in the same column of a RAM. Moreover, reduced LUT consumption indicates that more complicated neural networks can be implemented with differential multiply-accumulation. Fig. 20a demonstrates the calculation efficiency of the FPGAimplemented CBM (differential multiply-accumulation) compared with the software-implemented CBM. The results show that the calculation time of software implementation was reduced almost linearly with the number of neurons. The FPGA implementation with the differential multiplyaccumulation could improve the calculation speed by at least 1, 300 times. In the case of 300 neurons, the calculation speed was 6, 500 times faster. Fig. 20b demonstrates the calculation efficiency of FPGA-implemented CBMs (non-differential multiplyaccumulation) and FPGA-implemented CBMs (differential multiply-accumulation). The figure indicates that the calculation time improves as the number of neurons in the network increases.

B. CALCULATION SPEED
The calculation of the FPGA-implemented CBM comprises a multiply-accumulation part and neuron updating part. The differential multiply-accumulation reduces the calculation time of the multiply-accumulation part. In the non-differential multiply-accumulation, the multiplyaccumulation part requires an increased calculation time that is proportional to the number of neurons. On the contrary, the differential multiply-accumulation calculation time depends on the number of neurons that change the output in the time step. Fig. 2 shows that one neuron changes its output twice (i.e., 0 to 1 and 1 to 0) in one period of the output of neurons if the input of the neuron does not change. The period can be roughly considered as 1 = (α+β) on average regardless of the duration of the period depends on the input of the neuron. Hence, it could be considered that 2N /n t of output changes occur in a time step on average when the period is divided into n t time steps, where N refers to the number of neurons in the network (n t = 2 12 = 4, 096 in Fig. 20b). It indicates that the differential-multiply accumulation requires 2N /n t iterations for a single update of the network; meanwhile, the non-  differential multiply-accumulation requires N iterations for every update of the network.

C. RELATED WORK
CBMs are equivalent to Ising models; therefore, CBMs are also adaptable to combinatorial optimization problems. Many studies were related to Ising models [25], [26] and their hardware implementations [27]- [29] in recent years because many problems in our society can be expressed as combinatorial optimization problems. Table 10 compares the results of our implementation and related studies. The table shows that the number of edges in our study and the bit precision of edges is improved compared with other studies. The increased number of edges and bit precision implies fewer limitations on the mapping of combinatorial optimization problems; for instance, the traveling salesman problem (TSP) requires the square number of nodes to be equal to the number of cities in the network, and a sufficient amount of precision is required to express the constraints and the distance between cities in the problem. Note that the number of edges inside parentheses is calculated in this study, as shown in 11 (an edge between two nodes in graphs on the table is assumed to be bidirectional, and the grid of the lattice graph is assumed to be connected like a torus).
Implementations use three kinds of topologies: the complete graph, chimera graph [14], and lattice graph, as shown in Fig. 21. In a complete graph, every pair of nodes is connected by an edge; therefore, the complete graph includes   other types of graphs, such as the chimera graph and lattice graph. The lattice graph is a graph whose nodes are arranged on intersections of a grid, and the nodes are connected to the adjacent four nodes. The chimera graph has four nodes of complete graphs as its subgraph, the subgraphs are arranged as the king's graph [30], and the nodes are connected to the adjacent eleven nodes. Table 11 indicates that the number of edges depending on the number of nodes in the complete graph is more than in the other graph topologies. However, the lattice graph and the chimera graph are often implemented into hardware because their connectivities are limited to the adjacent nodes; therefore, these topologies are easily accommodated into hardware that have a two-dimensional structure. In particular, the chimera graph can increase the number of connectivities for each node by limiting the connectivities to the adjacent nodes. The problem of increasing the number of edges needs to be solved to implement the complete graph topology into hardware. In our work, block RAMs in FPGAs were used to address this problem as well as a study by Yamamoto [23] that used static random access memories SRAMs. Nevertheless, it is still possible to represent a complete graph with chimera graphs by a minor embedding [31]; however, the number of edges needs to be more than the target complete graph, to represent the complete target graph by combining nodes of the original graph.
CBMs do not require random numbers in their calculation, whereas the Ising model needs to use random numbers so that it is not tricked into local optima. Random number generators require significant hardware resources; in the case of Yamamoto [23], random number generators consumed 11 percent of hardware resources in the implemented circuit.
Yoshimura [14] proposed random pulse sharing to reduce the hardware resource utilization of random number generators; then, a random number generator was shared among multiple calculation units not updated at the same calculation step, nearby. However, hardware resources are still consumed by random number generators in these implementations [22], [24]. In contrast, Gyoten proposed shift-register-based spin flipper (SRSF) [15], [32]: an implementation method that enables implementation circuits of the Ising model to eliminate random number generators. In our work, random numbers generators are eradicated by the deterministic algorithm of CBMs.
To increase the number of combinatorial optimization problems that can be mapped to the network, a massive number of connectivities need to be stored in hardware such as an FPGA. Moreover, the bit precision requires to be sufficiently big to express the problems. Block RAMs in FPGAs may be used to achieve this owing to their overpowering capacity, compared to LUTs. However, values of connectivities stored in RAMs need to be accessed one by one, when referenced by the address signal. Our implementation addressed this limitation by using the time-division of the multiply-accumulation and suppressed the increasing calculation time by the differential multiply-accumulation. Additionally, the time-division enabled us to use DSPs on the multiply-accumulation. There is a similarity between this study and related research. Yamamoto [24], [33] proposed a time-division multiplexing architecture (TDM), which divided the network logically (called ''phase'') and updated spins phase by phase, then values of connectivities could be stored in RAMs and accessed one by one. However, the calculation time increased as the number of phases increased because of the time-division of the calculation. Yamamoto [23] could address the difficulty by the delta-driven simultaneous spin update (DDSS) that calculates updates inputs of spins by the difference from the prior inputs, as in our implementation.
The comparison shows CBMs' superiority because random numbers are not required in the hardware implementation. Moreover, our implementation's advantage is also revealed: the differential multiply-accumulation enables the utilization of block RAMs and DSPs in FPGAs. This advantage produced an FPGA implementation of the largest completegraph network and the biggest bit precision compared with other studies.

VI. CONCLUSION
In this study, a CBM was implemented in an FPGA using the proposed hardware implementation methods VOLUME 8, 2020 (i.e., the hardware-oriented algorithms and the differential multiply-accumulation). The results of the numerical simulations indicated that the hardware-oriented algorithms were feasible. The results of the logical synthesis revealed that our proposal drastically reduced the consumption of hardware resources. Moreover, the results of the FPGA implementation show that the FPGA-implemented CBM can be applied to combinatorial optimization problems (such as the maxcut problems) and complete computations quicker than the software implementation.
The comparison of our study and previous research indicates the superiority of our implementation. The non-linear and chaotic dynamics of CBMs reduced hardware resource utilization. The network in our implementation has more edges than any other related implementation, which indicates that our implementation has fewer limitations in the mapping of combinatorial optimization problems.
The FPGA-implemented CBM proposed in this study could be applied to more practical combinatorial optimization problems in our future work. Additionally, the scale of the implementation could be expanded by improving the FPGA implementation of the differential-multiply accumulation.