A Novel MPEG Audio Degrouping Algorithm and Its Architecture Design

. Degrouping is the key component in MPEG Layer II audio decoding. It mainly contains the arithmetic operations of division and modulo. So far no dedicated degrouping algorithm and architecture is well realized. In the paper we propose a novel degrouping algorithm and its architecture design with low complexity design consideration. Our approach relies on only using the addition and subtraction instead of the division and modulo arithmetic operations. By use of this technique, it achieves the equivalent result without any loss of accuracy. The proposed design is without any multiplier, divider and ROM table and thus it can reduce the design complexity and chip area. In addition, it does not need any programming e ﬀ ort on numerical analysis. The result shows that it takes the advantages of simple and low cost design. Furthermore, it achieves high e ﬃ ciency on ﬁxed throughput with only one clock cycle per sample. The VLSI implementation result indicates the gate counts are only 527.


Introduction
MPEG audio coding standard is the international standard for the compression of digital audio signals [1]. It can be applied both for audiovisual and audio-only applications to significantly reduce the requirements of transmission bandwidth and data storage with low distortion. The second phase of MPEG, labeled as MPEG-II, aims to support all the normative features listed in MPEG-I audio and provides extension capabilities of multichannel and multilingual audio and on an extension of standard to lower sampling frequencies and lower bit rates [2,3]. Besides, one of the audio coding, Advanced Audio Coding (AAC), is an international standard which is first created in MPEG-II AAC and the base of MPEG-IV general audio coding [4].
MPEG audio compression standard also defines three layers of compression, named Layer I, II, and III. Each successive layer offers better compression performance, but at a higher complexity and computation cost. Basically Layer I and II are similar and based on subband coding. The difference between them mainly relies on the formation of side information and a finer quantization is provided in Layer II. Layer III is a well-known audio application and popularly named as MP3. It adopts more complex schemes such as hybrid filterbank, Huffman coding, and nonlinear quantization. From the viewpoint of hardware complexity and achieved quality, Layer II might be a reasonable compromise for general usage. In the official ISO/MPEG subject tests, Layer II codec shows an excellent performance of CD quality at a 128 Kbps per monophonic channel [5]. It has also been adopted in Digital Audio Broadcasting (DAB) standard.
Within the Layer II decoding, degrouping is the key component which can recover the samples from a more compressed codeword. The degrouping module is quite special compared with other popular compression techniques, such as subband or Huffman decoding. Although the computation-intensive characteristic in subband decoding induces large computation complexity, it can be efficiently improved no matter in algorithm or architecture level [6,7]. However, as will be described in more detail below, the arithmetic operations for degrouping mainly contain division and modulo. Unfortunately, degrouping operation only happen in Layer II decoding. Even in a higher layer, Layer III (MP3), the degrouping is reorganized and recombined in Huffman decoding to eliminate the division and modulo computation.  For the recent trend, a universal MPEG audio decoding which can support multiple standards is widely developed and applied in many multimedia and communication devices [8,9]. They solved the common and regular module, synthesis subband with relative improvements. However, they still left some unsolved issue on the other nonregular modules. In fact, degrouping is a must module no matter the target design is on Layer II only, or on a multistandard decoder. As in the conventional methods, the general purpose CPU, DSP, or ASP (audio signal processor) usually provides some division or modulo instructions to execute the arithmetic operations of degrouping [10][11][12]. Basically these designs implied either a divider directly, or a multiplier by finding the inverse of the divisor and multiplying the inverse by the dividend. In fact, the numerical analysis methods suffer some low-end general purpose processors that especially the low-end general purpose processors that are initially chosen to play a simple role as a parser or controller. Even for some high-end processors, to support the additional instruction set of division or modulo is also an overhead. Consequently, these approaches will increase the hardware complexity and the chip area. Several techniques used a ROM-based table lookup to replace the multiplier [13,14]. However, ROM circuit grows exponentially with the dimension of the finite field. Although many fast algorithms for computing the division and modulo arithmetic operations have been presented throughout the years [15][16][17], these techniques cannot be completely adopted in the MPEG degrouping algorithm. One of the concern is that 9 V c = 81z+9y +x 0-728 10 these previous methods mainly focused on generating the modulo calculation only. Quotient results are useless for their need. Nevertheless, in degrouping the quotient cannot be skipped because it represents the codeword for the next iteration. So far no dedicated degrouping algorithm and its architecture is investigated.
In the paper, we propose a novel MPEG degrouping algorithm and its architecture design. It is built by using quite different design concept than all the reference works. Our approach relies on just only using the addition and subtraction instead of the traditional division and modulo arithmetic operations, and without any loss of accuracy. It eliminates the need of iterative division computation in original algorithm. Based on the proposed algorithm, no multiplier, divider and ROM table is needed. The design takes the advantages of simple and low cost, and high efficiency result with fixed throughput. It only occupies 527 gate counts with 8.35 ns propagation delay. With this easyfor-use and compact-size design, it is suitably integrated as an Intellectual Property (IP) in System-on-Chip (SOC) design trend.

MPEG Degrouping Process
The overall MPEG decoding flow chart is described in Figure 1. It includes some major functional blocks: decoding of side information, requantization, and synthesis subband filter bank. Figure 1 also shows a further decomposition of requantization of samples in Layer II application, where degrouping represents an essential component. We describe the grouping and degrouping process in more detail below.

Grouping.
In MPEG audio encoder, given the number of steps from bit allocation, the samples will be quantized. The further compression feature in Layer II allows two new quantizations, namely, 5-level and 9-level. For these new quantizations plus the former 3-level quantization, sample grouped coding is used. If grouping is required, three consecutive samples are coded as one codeword. Only one value V j is transmitted for this triplet. For 3-, 5-, and 9level quantization, a triplet is coded using a 5-, 7-, or 10-bit codeword, respectively. The relationships between the coded value V j ( j = 3, 5, 9) and the three consecutive subband samples x, y, z are listed in Table 1.
In order to make a clear realization on the benefits of grouping processing, Figure 2 illustrates the examples of the three modes. For mode 1, a 5-bit codeword is grouped and it represents three 2-bit samples in actual. Consequently, the codeword nlevels the number of quantization steps Algorithm 1: Standard degrouping algorithm. one bit is saved without any data and precision loss. The same situation on mode 2 results in a saving with two bits, cause a 7-bit codeword can represent three 3-bit samples. In mode 3, two bits are also saved.

Degrouping.
While grouping is used in encoder, it is necessary to separate the combined sample codeword to several individual samples by degrouping in decoder. According to the grouping equation in Table 1, degrouping has to perform the division and modulo operations to separate the three individual samples. This process is defined by MPEG standard algorithm and depicted in Algorithm 1. Within the degrouping algorithm, the nlevels can be 3, 5, and 9. Table 3 summarizes the total arithmetic operations used in MPEG Layer II audio decoding. In the whole decoding, a characteristic analysis on the arithmetic operations shows that multiplication and addition are the most common operations where they are mainly applied in synthesis subband filter [18,19]. Specifically, degrouping only occupies about 1% computation power in the whole MPEG-II decoding process [20]. In SOC design trend, the computation amount is not the only concern. Instead, an easy-for-use issue without additional design effort on overall system should be applicable. Particular, the degrouping arithmetic operations are fully different from any other decoding functions and thus it cannot be shared with other resources. When facing the design of either Layer-II decoding only or a universal MPEG audio decoder, such a little but unavoidable computation engine leads to special design consideration and effort. Consequently, to reduce the circuit overhead and complexity, a low cost and high performance degrouping algorithm and its architecture are necessary.

Proposed Algorithm
A degrouping function in MPEG standard includes the division and modulo arithmetic operation. Unlike a straightforward implementation for these required arithmetic operations, our approach accomplishes it with only a simple addition and shifter operation. We make a mathematical deduction which implies it as a generic formula. In Section 3.1, a general form is derived. Concerning the specification of degrouping, Section 3.2 conducts the proposed degrouping algorithm.
To start it, let A and p be any two positive integers and A, p > 0. We can express the general form as A = p · q + r, where q is the quotient and r is the remainder. Besides, A can be represented as an n-digit tuple: where a n−1 , a n−2 , . . . , a 1 , a 0 ∈ [0, 1], n = log 2 (A + 1) . The operation {} is the simplified expression for a digit-based tuple. From (1), it follows that if p = 2 m , then A can be represented as given below · a m + a m+1 · 2 + a m+2 · 2 2 + · · · + a n−1 · 2 n−m−1 . ( In comparison with (1) and (2), q and r can be expressed as follows: q = a m + a m+1 · 2 + a m+2 · 2 2 + · · · + a n−1 · 2 n−m−1 = (a n−1 , a n−2 , . . . , a m+1 , a m ), 4 EURASIP Journal on Audio, Speech, and Music Processing Table 2: Calculation and deviation range of q and r .

Modes
Calculation method Deviation range a 8 , a 7 , a 6 , a 5 , a 4 , a 3 } − {a 9 , a 8 , a 7 , 3.1. General Form as p = 2 m + 1. As in (1), let p = 2 m + 1, then m = 1, 2, and 3 are mapping to the three modes of degrouping algorithm, respectively. From the previous discussions, it is expressed as follows: q k is the k-stage quotient, where it can be recursively expressed with the next-stage quotient and remainder q k+1 and r k+1 . Because q k < 2 m , q k+1 = 0, thus q k = r k+1 . From the iterative decomposition of (4), we proceed is as follows: Comparing between (2) and (5), let q and r are easily calculated. They can be viewed as the approximated results, which are not exactly equivalent to the correct quotient and remainder, q and r. From (6), because 0 ≤ r j ≤ 2 m − 1, for j = 1, 2 · · · k + 1, the range of q and r can be clarified as follows.
Substituting (7) into (5), we obtain the range of q as follows: In this case, the range of q is Now let us take consideration on three modes of m = 1, 2, and 3.

Arithmetic Operations for Mode 1, 2, 3.
The proposed algorithm for the calculation of q and r with their deviation ranges are illustrated in Table 2. It accomplishes the division and modulo by only processing the codeword A, which can be viewed as a 2-tuple representation of q k , r k . Each intermediate operand, denoted as A m for convenience, is obtained by shifting right m bits and dropping rightmost bits of A after each shift. Figure 3 describes a graphical representation of the proposed algorithm for the calculating of q and r in three modes. It shows that four operands are generated by shifting in mode 2 and 3. Then these operands take the interlacing computations by two subtractions and one addition. In mode 1, five operands are generated and the computation is achieved by two subtractions and two additions. The addition for the last operand of A 4, a one digit number, can be viewed as an additional carry for the adder. This approach takes the benefit on reducing one addition in mode 1. More specifically, the processes for all three modes are then equivalent.
In addition to the fast calculation on q and r , the exactly correct results of q and r must need future process from q and r . The correct result of r is obtained by getting the r plus or minus with a value of a divisor in each associated mode. The correct result of q is obtained by getting the q plus or minus with a value of one in all three modes. This implies that just a simple and regular correction is performed to get the exactly correct value of q and r from q 6 EURASIP Journal on Audio, Speech, and Music Processing Outputs m ? and r , respectively. The detailed flow chart for the proposed algorithm is depicted in Figure 4.

Data Reordering Scheme.
Based on the previous discussions, the proposed algorithm can be implemented by two subtractions and one addition with four operands: A, A m, A 2m, and A 3m in all three modes. In order to reduce the hardware cost, we use the concept of data reordering to change the data computation flow.
We compute the operands of A and A 2m and the associated arithmetic operation first, then compute the operands of A m and A 3m and the associated arithmetic operation. In fact, the result for A m plus A 3m is equal to the result for A plus A 2m by only shifting right m bits. This means that the arithmetic operation for A m plus A 3m is trivial and can be removed. The data reordering scheme reduces the arithmetic operations in saving of one subtractor hardware cost, as illustrated in Figure 5.
EURASIP Journal on Audio, Speech, and Music Processing

Architecture Design
In architecture design, the proposed algorithm with data reordering scheme is adopted. Figure 6 shows that the key components of this design include one special adder (SpADD), two subtractors (SUB); and two adders (ADD). Based on the maximum number range of codeword A in mode 3, 10 bit-width bus is assigned for A. The shifter takes the right shift of 2m bits to obtain another operand from A. The SpADD generates a 10-bit sum of s, and two one-bit carries of co0, co1. co0 is the carry of addition for 4-bit LSB and co1 is the carry for all-bit addition. As indicated in Figure 6, the signals of s, co0, and co1 can be demultiplexed into the partial quotients of q + and q −, and the partial remainders of r + and r −. q +, q −, r + and r − represent the operand with the 2-tuple representation of q k and r k in Figure 3. These partial results are fed into the two subtractors to generate the q and r . The following two adders take the roles of correcting the q and r into the real results of q and r. Finally, the operand of q is fed back and latched in the input register for the use of next degrouping cycle. This approach achieves the fixed throughput with one clock cycle per sample.
The internal architecture of SpADD is illustrated in Figure 7. It basically consists of four full adders and six half adders with a ripple-carry architecture. The signal of c 0 is the carry represented as the additional operand in mode 1.
Rescaling y = ax The implemented circuit is nonpipelined. However, it can be easily pipelined with the addition of register at every stages. Moreover, this architecture takes the advantages of simple and low cost design, but high efficiency requirement.

Comparisons and Experimental Results
In this section, we describe the comparisons and experimental results with our proposed algorithm. The experiments attempt to cover the whole range of A for all three modes, as illustrated in Figures 8, 9, and 10. They show the deviations of q with respect to q, and r with respect to r. From the approximated result of q and r , and the real result of q and r, the derivation between them are varied periodically. Some value q and r are equal to q and r, but some of them are not the same. For example, in mode 1 it shows that when the value of r is greater than 2, the value of q is less than q. When the value r is less than 0, the value of q is greater than q. Every difference between q and q is exactly equal to one. The comparisons between the standard and proposed algorithm with two schemes are illustrated in Table 4. All the computation functions must have the minimum wordlength of 10 bits to satisfy the whole range of A. In addition, the architectural comparisons between the proposed design and some conventional techniques are shown in Table 5.
The proposed degrouping architecture is implemented as an IP with VLSI technical details and summarized in Table 6. As the characteristics of regularity and modularity, our novel design only needs 527 gates based on the applied technology. It can run at about 120 MHz which is many times speedup compared with the low operating frequency of 44.1 KHz audio sample rate. It also has the advantages of fix throughput with one clock cycle per sample.
In order to reflect our advantages in more detail, two reference designs with real implemented results are constructed and listed in Tables 7 and 8 Figure 10: Experiment results of mode 3 for the deviation value of (a) q with respect to q and (b) r with respect to r.    Table 7, totally the gate count is more than 3400 including the storage element and decoding circuit. From Table 6, it almost takes seven times of gate count than ours. Another implementation result is listed in Table 8.
It is implemented on one of the popular general purpose processor with its two version, ARM7 and ARM9 [21]. The results show that, each processor performs the degrouping iteration with 223 and 142 clock cycles, respectively. For our hardwired degrouping design, only 3 cycles are consumed to acquire the 3 output samples in each iteration. Note that the programmable processor certainly needs the space to store the programming code. In their results almost 2 KB memory are needed. Based on the comparison results, our design can achieve the low complexity and high efficiency considerations, while still keeps the least usage on area.

Conclusions
Although only occupying little computation power in the whole decoding process, degrouping process is an essential component in MPEG Layer II audio decoding, especially when meeting the universal MPEG audio decoding requirement. A straightforward design without thorough consideration on algorithm makes an inefficient result. So far no dedicated degrouping algorithm and architecture is developed. We have proposed a novel degrouping algorithm which relies on only using the addition and subtraction instead of the division and modulo arithmetic operations supplied by standard algorithm. It maintains high efficiency without loss of any accuracy. The proposed design is without any multiplier, divider, and ROM table. In addition, to reduce the arithmetic operations in saving of one subtractor, a modified scheme of data reordering is constructed. Based on our algorithm, we propose a degrouping architecture with the advantages of simple and low-cost design, and high efficient requirement on fixed throughput. Compared with the general approaches such as direct table lookup or direct programminglevel solution, our method outperforms them either in physical gate count or throughput. It is easily applicable without any programming cost. The VLSI implementation result shows that only 527 gate counts are realized. It is proper to be integrated as a hard IP in the SOC design trend.