Reconfigurable VLSI implementation for learning vector quantization with on-chip learning circuit

Xiangyu Zhang; Fengwei An; Lei Chen; Hans Jürgen Mattausch

doi:10.7567/JJAP.55.04EF02

1. Introduction

Visual perception as one of the most advanced human capabilities is very difficult to achieve for artificial object recognition systems. Whereas, humans have the ability to detect and recognize thousands of objects in a scene with little or no conscious effort, despite changes in occlusions, illumination and the object's pose. Artificial neural networks (ANNs) are widely applied and very effective for pattern recognition,¹^,²⁾ function approximation,³⁾ scientific classification,⁴^,⁵⁾ control,⁶⁾ and the analysis of time serial data.⁷⁾ Usually, ANNs have intrinsic units with massive vector-parallelism and a large number of interconnections among each other. Hardware ANNs based on conventional single instruction multiple data (SIMD-mode) solutions, which help to achieve often necessary real-time response due to their parallel processing ability, have attracted increasing attention and have already been applied for color image compression,⁸⁾ computation engines,⁹⁾ robot locomotion control,¹⁰⁾ multilayer perceptrons,¹¹⁾ wind-speed sensor less control,¹²⁾ olfactory systems,¹³⁾ real-time object detection,¹⁴⁾ and so on.

Self-organizing-map (SOM) neural network models, which were introduced by Willshaw et al.¹⁵⁾ and Kohonen,¹⁶⁾ have been used in a wide variety of fields such as unsupervised learning tasks,¹⁷⁾ data exploration,¹⁸⁾ and water resource exploration.¹⁹⁾ Learning vector quantization (LVQ), which is the variant of SOM, has been used extensively in supervised tasks, especially supervised learning and classification. LVQ was introduced by Kohonen²⁰⁾ as a family of intuitive, universal and efficient multiclass classification algorithms. There have been many applications of LVQ, such as in handwriting recognition,²¹⁾ odor recognition,²²⁾ medical biology,²³⁾ economical optimization,²⁴⁾ and alertness detection.²⁵⁾

The learning process of LVQ is intuitively clear and classification decisions are based on the nearest neighbor search (NNS) among the reference vectors, also called neurons as well. In general, learning in the LVQ algorithm is realized by modifying the reference-vector values according to a distance function and the input-vector matching results, thus representing a process of approximating the theoretical Bayes decision borders.²⁶⁾ The winner-reference vector, which is most similar to the input vector, is adjusted towards the input vector, if their classes are the same. Otherwise, the winner-reference vector is moved away from the incorrectly classified input vector. At the beginning of the learning process, reference vectors at some initial positions are randomly selected. Then, the input vectors for the learning process are sequentially processed and the values of reference vectors are continuously updated to increase the LVQ accuracy.

VLSI implementations of LVQ have already been realized as system-on-a-chip platforms,²⁷⁾ specialized digital circuits²⁸⁾ and analog circuits.²⁹⁾ Although these previous implementations provide massive intrinsic parallelism, adaptability of VLSI implementations to different applications is still a largely unsolved issue. In this paper, we propose a memory-based solution of LVQ implementation as a dual-mode system, which allows dynamical configuration for on-chip learning and classification. This solution is based on a reconfigurable pipeline with parallel p-word input (R-PPPI) architecture, which has large flexibility to cover many applications, very low power consumption and high learning speed. Rather than adding a separate circuit for the second mode, which would require more hardware resources, on-chip learning and classification are implemented through reconfiguration of the R-PPPI architecture.

The contents of this paper are organized as follows: Section 2 introduces the basic LVQ (LVQ1) algorithms. Section 3 describes thoroughly the proposed architecture with a focus on the reconfiguration mechanism for on-chip learning and classification. Experimental results are analyzed in Sect. 4. Based on these experimental results, Sect. 5 further discusses and compares the execution time for the learning mode. Finally, conclusions are given in Sect. 6.

2. LVQ

LVQ is based on heuristics and has evolved into a popular class of learning algorithms for nearest-reference-based classification, especially multiclass classification. It provides a good balance between the approximation of the classification boundaries and the data representation. The decision boundaries between classes are approximated locally. In this way, LVQ can largely reduce the number of reference vectors that should be stored and compared with, because it aims at using only a small set of optimized reference vectors instead of all the reference vectors in the learning database. Therefore, once LVQ is trained for a particular problem, it can produce correct classification results in a very short time.

For LVQ algorithms, different from many other classification algorithms such as support vector machines (SVMs),³⁰⁾ the typical characteristics of classes within a data set are represented by reference vectors. Therefore, implementation of LVQ is more attractive for researchers of other fields different from machine learning. Moreover, the learning rate may be constant or decrease monotonically when adapting to different applications including multiclass classification problems. In most cases, to improve the generalization ability and the representation of variations within a class, each class has multiple reference vectors instead of a single reference vector.

The closest reference vector to the input sample, namely winner reference vector (w_s), is determined in accordance with a distance metric, e.g., the Euclidean distance. In the learning mode, only the weights of the winner reference vector are updated to move towards or away from the input sample for efficiently learning the optimized winner-reference positions used in the classification. If the classifier agrees with the actual class of an input sample, that is, the winner class-label matches the class-label of the input sample, the so-called winner is moved towards the input sample. Otherwise, the winner is moved away in an attempt to increase the LVQ accuracy. In this paper, the basic LVQ algorithms, namely, LVQ1 is analyzed and implemented on an ASIC with on-chip learning and classification capability.

Suppose that a set of R reference vectors $\{ (d_{i},v_{i}),i = 1,2,3, \ldots ,R\}$ , where d_i is a d-dimensional vector in the feature space, v_i is its class-label, and v is the number of different categories. To store an LVQ1 classifier, the space required is Ө(Rd). To classify an unlabeled input sample, the time needed is O(Rd). In other words, the choice of the reference number R represents a tradeoff between classification accuracy and computational complexity. On the other hand, the class number v should be smaller than the number of references R so that there are one or more references for each class. To simplify the algorithm, the same number of references for each class is typically assumed. Suppose that x(t) and w_s(t) represent input and winner reference vector in the discrete-time domain, respectively. Correspondingly, v_x and v_s are the class labels of input x(t) and winner reference vector w_s(t). The winner reference vector w_s(t) for an input x(t) is determined according to Eq. (1):

$\begin{equation} s = \arg [\min_{i}D(x,d_{i})],\quad \text{for $i \in \{1,2,3,\ldots,R\}$}. \end{equation} \tag{ 1 }$

Here $D(x,d_{i})$ is the distance between input x and reference vector d_i. A popular choice for D is the Euclidean distance (D_E). The distance calculations in a high dimensional space or for a large reference-vector number are the main causes of computational complexity. In particular, the squared Euclidean distance ( $D_{\text{E}}^{2}$ ) is preferred over D_E, since the root operation has no influence on the distance comparison result but only contributes to an increased computational complexity. In case of d-dimensional input- and reference vectors, $D_{\text{E}}^{2}$ is described by Eq. (2).

$\begin{equation} D_{\text{E}i}^{2} = \sum_{j = 1}^{d}(x_{j} - d_{i_{j}})^{2},\quad \text{for $i\in \{1,2,3,\ldots,R\}$}. \end{equation} \tag{ 2 }$

Furthermore, a(t) is defined as the learning rate. In the learning mode, v_x is known. Then, w_s(t) is updated to better comply with x(t) according to the modification in Step 3 of the learning process listed below.

Step 1: Randomly initialize reference vectors to v classes and set the learning rate a.

Step 2: For one labeled input vector x(t), find the winner reference vector (w_s) to this labeled input vector.

Step 3: Update of winner reference vector based on the NNS result to better comply with the labeled input vector x(t). If x(t) and w_s(t) belong to same class, i.e., v_x is equal to v_s, w_s(t) is moved closer to x(t) in order to increase the future classification accuracy and the new value of the winner vector becomes:

$\begin{equation} w_{s}(t+1)= w_{s}(t)+a[x(t)- w_{s}(t)]. \end{equation} \tag{ 3 }$

In the other case, if x(t) and w_s(t) belong to different classes, w_s(t) will be distanced from x(t) to decrease incorrect future classification probabilities. The new value of the winner vector becomes:

$\begin{equation} w_{s}(t+1)= w_{s}(t)-a[x(t)- w_{s}(t)]. \end{equation} \tag{ 4 }$

Step 4: Repeat Steps 2 and 3 until a threshold is reached. An often used threshold is the fixed number of learning iterations, which results from the available number of label input vectors for the learning process.

In the classification mode, the input vector x(t) is unlabeled, and is assigned to a class according to the class label of its winner reference vector. That is, assigning the label v_s to x(t).

$\begin{equation} v_{x} = v_{s} \end{equation} \tag{ 5 }$

3. Realization for on-chip learning and classification

Parallelization for the learning procedure can drastically reduce the training time. The dual-mode system is implemented by the R-PPPI architecture to switch between on-chip learning and classification mode of the LVQ.³³⁾ In the architecture for the p-parallel module of the R-PPPI shown in Fig. 1, the data path for the two modes is configured according to the signal "L/C".

**Fig. 1.** R-PPPI architecture for a memory-based LVQ neural network. The same hardware parts are configured to have different functionality in different operating modes of learning and recognition.
Download figure:
Standard image High-resolution image

The VLSI realization of LVQ is mainly integrated with four blocks: input layer, competition layer, winner-takes-all part, and output layer. In the input layer, a concept of partial vector-component storage ensures flexibility for different vector dimensionalities.²⁷⁾ The d-dimensional input (IN) and reference (REF) vectors are stored into p memory blocks in the form of m ( $m = \lceil d/p \rceil$ is the smallest integer not less than d/p) partial vector-components. When the partial storage of one vector is finished, the signal "Next" in Fig. 1 is asserted for separating two reference vectors. Signal "Next" is controlled by the addresses of input- and reference-vector memories.

The competition layer is composed of one weight unit and one summation unit and solves the highest-computational-demand part of LVQ. In the learning mode, the sign of α is assigned in accordance with the class-label comparison result between x(t) and w_s(t) through signal "C/I" after NNS has finished. Then, the weight unit computes a[x(t) − w_s(t)] and delivers the results to the summation unit for attaining the new component values of the reference vector as shown in Fig. 2. Finally, the old p reference-vector components are overwritten with the new values in parallel through the data bus. In the classification mode, the weight unit computes [x − w_i]² and delivers the results to the distance accumulation adder tree for $D_{\text{E}}^{2}$ calculation as illustrated in Fig. 3.

**Fig. 2.** Example of P-word parallelism for the on-chip learning circuit.
Download figure:
Standard image High-resolution image

**Fig. 3.** Example of 16-word parallelism for the distance accumulation adder tree.
Download figure:
Standard image High-resolution image

Signal "L/C" mainly reconfigures the dataflow through the multiplexers (M2) and thus controls the mode switching between learning and classification. M2 selects one of multiplier inputs for either square calculation of the distance computing or weighted difference calculation of the reference-vector updating. The constant learning-rate factor a is initialized with positive and negative signs at the inputs of M1, which selects +a if the class-labels of x(t) and w_s(t) are the same and otherwise −a.

The winner-takes-all part consists of comparators and multiplexers for pipelined distance accumulation and comparison. The intermediate minimum distance is stored in register S5 and compared to the accumulated $D_{\text{E}}^{2}$ in register S4. Signal "Load" asserts to update the value in S5 by the distance in S4 when S4 < S5. At the end of the classification mode, the class label of the winner reference w_s(t) is outputted and assigned the class of the input sample x(t).

The on-chip learning circuit is implemented with p-word parallelism as shown in Fig. 2. It works as a feedback network for updating winner reference vector according to the NNS result, corresponding to the back-propagation algorithm of the LVQ in the learning mode. The p-word additions or subtractions between the winner vector w_s(t) and the weight of a[x(t) − w_s(t)] are determined by the sign of the learning rate a.

The block diagram of distance accumulation adder tree in Fig. 1 for summarizing the partial $D_{\text{E}}^{2}$ values is illustrated by an example with word parallelism of p = 16 in Fig. 3, where ${D_{\text{E}}^{2}}_{n}\ (n = 1,2,3, \ldots ,16)$ are the squared differences of each vector component. In this case, 16 component distances are summed in 4 pipeline stages with low delay time. Because of the partial storage concept for input vectors, m squared difference $MD_{\text{E}}^{2}$ feed into the winner-takes-all part for local minimum distance searching until signal "Next" is asserted.

Instead of off-chip learning processing, a functional reconfiguration architecture based on the main classification circuit is implemented here for on-chip learning, since NNS processing is the critical path in both learning and classification mode. The designed dual-mode system can be reconfigured instantly by the multiplexing switches. Moreover, it is possible to implement also unsupervised online machine learning, which requires the provided high learning speed.

The memory blocks for reference vectors (REFs) are implemented by 2-port memories with independent read-only and write-only ports for updating the old components of the reference vectors in the learning procedure. In general, 2-port memories often need special mechanisms to manage read-write conflicts. Due to the inherent characteristics, the R-PPPI architecture has several pipeline delays between the reading and writing at the same address, so that read-write conflict don not occur.

4. Implementation results

A prototype of the LVQ VLSI realization based on the R-PPPI architecture (p = 8) was fabricated in 180 nm CMOS technology as shown in the photomicrograph of Fig. 4. Since 8-word parallelism and 16 bit precision are chosen in this design, the R-PPPI architecture has a throughput of 128 bits per clock cycle and a pipeline latency of 8 stages. Moreover, the designed LVQ ASIC with on-chip learning and classification, which has core area of 7.89 mm², can handle at maximum 4096-dimensional vectors. In the classification mode, except for 8 clock cycles (106 ns at 75 MHz) of the pipeline latency, each d-dimensional (d ≤ 4096) vector can be processed in every $\lceil d/8 \rceil$ − 1 clock cycles. Consequently, a large number of different applications can be handled due to the high flexibility in vector dimensionality and the reference-vector number. In principle, the designed LVQ on-chip learning and recognition hardware can accommodate any application with feature vectors of up to 4096 dimensions. For example, in the case of 3780-dimensional feature vectors [histogram of gradient (HOG) feature³²⁾ used in pedestrian detection], the partial storage parameter m ( $= \lceil d/p \rceil$ ) is defined as 473, where the unused 4 words in the last partial group of components are simply filled up with zeros. In this way, each test 3780-d feature vector can be classified in 473 × R clock cycles where R, usually below 100, is the reference number (6.3R µs at 75 MHz). Furthermore, the on-chip learning with very high learning speed enables the application in online machine learning.

**Fig. 4.** Micrograph of the fabricated chip in 180 nm CMOS technology with 8-word parallelism for the PPPI architecture.
Download figure:
Standard image High-resolution image

Comparison with previous state-of-the-art work is shown in Table I. Even though our chip works at a higher operating frequency, and has more shared memory, it has much smaller power consumption than the previous work in Refs. 27, 34, and 35. The performance, in the case of 8 or less than 8 dimensional reference vectors, both for the recognition speed (Sr) and the learning speed (Sl) for each iteration of a reference-vector update, are clearly better than that in Ref. 27. In addition, our chip not only can handle applications with more references and higher dimensionality, but also has better bit precision. When reference number R is fixed to 256, our chip achieves as many as 2.84 and 2.81 million recognitions per second (MRPS) for 16-d and 32-d reference vectors, respectively. Such processing speed is much higher than that in Refs. 27, 34, and 35. Moreover, for conceptual verification of the developed memory-based LVQ architecture, complex external control units or host PCs are not needed for our test chip.

Table I. Performance comparison.

	SIMD solution 1³⁴⁾	SIMD solution 2³⁵⁾	SoC solution²⁷⁾	This work
CMOS technology	0.18 µm	0.8 µm	0.18 µm	0.18 µm
Power consumption (mW)	630 (50 MHz@1.8 V)	425 (45 MHz@3.3 V)	214 (25 MHz@1.8 V)	66.38 (75 MHz@1.8 V)
Storage capability (kbit)	4	8	96	102
S_r (µs)	—	—	0.28	0.106
S_l (µs)	—	—	20.9	1.15
Throughput (Gbps)	—	1.14	2.23	9.6
Max number of references	256	16	512	512
Bit precision	12 bit@16-d	8 bit	16 bit	16 bit
Max dimension flexibility	32	128	1024	4096
Processor performance (MRPS)	0.0454 (256-R@16-d) 0.0312 (256-R@32-d)	0.25 (16-R@16-d) —	0.94 (256-R@16-d) 0.93 (256-R@32-d)	2.84 (256-R@16-d) 2.81 (256-R@32-d)

5. Discussion of on-chip learning

As described above, the on-chip learning procedure is realized by the R-PPPI architecture which has dual-mode capability configurable by the signal "L/C". The learning time in clock cycles can be defined as in Eq. (6), where R is the reference number, $\lceil d/p \rceil$ is the partial storage parameter and d is the vector dimensionality. The parallelism p of R-PPPI is a power of 2, namely 2^y. The number "3" in Eq. (6) represents the pipeline delays of registers S1, S2, and S3. The register S1 separates the memory blocks of the input layer from the subtractors. S2 is located between the subtractors and the multipliers, and S3 is between the multipliers and the adders, which are illustrated in Figs. 2 and 3. The parameter PD is the pipeline depth defined in Eq. (7). In particular, the first "2" is the pipeline delay of S1 and S2, while the pipeline delays due to S4 and S5 are reflected in the second "2" of Eq. (7).

$\begin{align} T_{\text{learning}}&=(R+1)\times \lceil d/p\rceil + PD +3. \end{align} \tag{ 6 }$

$\begin{align} \mathit{PD}&=2+y+2. \end{align} \tag{ 7 }$

In the case of the fabricated test chip which has 8 parallel inputs, an on-chip learning step with one input vector needs $(R + 1) \times \lceil d/8 \rceil + 10$ clock cycles. Indeed, the learning efficiency has been improved to a much higher factor than for the conventional solutions even though the reference number R and the vector dimensionality d still have some limited effects. The comparison of the learning efficiency to the general purpose processor (Intel^® Core™ i7) and the SoC solution²⁷⁾ for pedestrian detection with 3780-dimensional HOG feature is illustrated in Fig. 5, where 2416 positive samples and 12180 negative samples in INRIA dataset³¹⁾ are used to train the LVQ references. The learning time with different of reference-vector numbers and the speedup factor to the software implementation demonstrate the very high learning efficiency that make online machine learning possible. Through applying a larger capacity memory, the designed LVQ ASIC can be extended to deal with much larger dimensional vectors and larger reference-vector numbers.

As shown in Fig. 5, the hardware implementation remarkably outperforms the software implementation on a PC with an advanced 3.40 GHz Intel^® Core™ i7-4770 CPU and 8 GB of RAM memory as well as the SoC solution²⁷⁾ with a low power RISC CPU. In addition, the larger the number of reference vectors is, the larger speedup factor becomes. When the reference-vector number reaches 1000, the speedup factor is nearly 200 times. For LVQ algorithms, the accuracy increases with larger numbers of reference vectors. Apart from the much faster learning speed than in the software implementation, this work with much lower power dissipation also has very high energy efficiency. Although this work has somewhat lower flexibility than the general purpose CPU, the proved extendibility in vector dimensionality and reference-vector number allows to handle most of the real-world applications.

6. Conclusions

In this paper, a memory-based VLSI realization for LVQ neural networks using the R-PPPI architecture was designed for on-chip learning and classification and fabricated in 180 nm CMOS technology. The short learning time and high flexibility improves the applicability for a large number of practical applications. The R-PPPI architecture is verified to execute the dual modes of learning and recognition with very low power dissipation and small Si-area consumption. Moreover, the nearest neighbor search, the part with the highest computational demand, is the critical computational complexity solved by this reconfigurable R-PPPI architecture as well. The fabricated chip has furthermore demonstrated the high learning and classification speed.

Acknowledgments

This research was supported by grant 25420332 from the Ministry of Science and Education, Japan. The VLSI-chip was fabricated through the chip fabrication program of VDEC, the University of Tokyo in collaboration with, Rohm, Synopsys, and Cadence. The used standard cell library was developed by Tamaru/Onodera Laboratory of Kyoto University and released by Professor Kobayashi of Kyoto Institute of Technology.

Reconfigurable VLSI implementation for learning vector quantization with on-chip learning circuit

Article metrics

Permissions

Author e-mails

Author affiliations

Dates

Abstract

1. Introduction

2. LVQ

3. Realization for on-chip learning and classification

4. Implementation results

5. Discussion of on-chip learning

6. Conclusions

Acknowledgments

Reconfigurable VLSI implementation for learning vector quantization with on-chip learning circuit

Article metrics

Permissions

Share this article

Author e-mails

Author affiliations

Dates

Abstract

1. Introduction

2. LVQ

3. Realization for on-chip learning and classification

4. Implementation results

5. Discussion of on-chip learning

6. Conclusions

Acknowledgments