Elsevier

Signal Processing

Volume 88, Issue 6, June 2008, Pages 1313-1326
Signal Processing

Type-IV DCT, DST, and MDCT algorithms with reduced numbers of arithmetic operations

https://doi.org/10.1016/j.sigpro.2007.11.024Get rights and content

Abstract

We present algorithms for the type-IV discrete cosine transform (DCT-IV) and discrete sine transform (DST-IV), as well as for the modified discrete cosine transform (MDCT) and its inverse, that achieve a lower count of real multiplications and additions than previously published algorithms, without sacrificing numerical accuracy. Asymptotically, the operation count is reduced from 2Nlog2N+O(N) to 179Nlog2N+O(N) for a power-of-two transform size N, and the exact count is strictly lowered for all N8. These results are derived by considering the DCT to be a special case of a DFT of length 8N, with certain symmetries, and then pruning redundant operations from a recent improved fast Fourier transform algorithm (based on a recursive rescaling of the conjugate-pair split-radix algorithm). The improved algorithms for DST-IV and MDCT follow immediately from the improved count for the DCT-IV.

Introduction

In this paper, we present recursive algorithms for type-IV discrete cosine and sine transforms (DCT-IV and DST-IV) and modified discrete cosine transforms (MDCTs), of power-of-two sizes N, that require fewer total real additions and multiplications (herein called flops) than previously published algorithms (with an asymptotic reduction of about 6%), without sacrificing numerical accuracy. This work, extending our previous results for small fixed N [1], appears to be the first time in over 20 years that flop counts for the DCT-IV and MDCT have been reduced—although computation times are no longer generally determined by arithmetic counts [2], the question of the minimum number of flops remains of fundamental theoretical interest. Our fast algorithms are based on a recently published fast Fourier transform (FFT) algorithm, which reduced the operation count for the discrete Fourier transform (DFT) of size N to 349Nlog2N+O(N) [1] compared to the (previous best) split-radix algorithm's 4Nlog2N+O(N) [3], [4], [5], [6], [7]. Given the new FFT, we treat a DCT as an FFT of real-symmetric inputs and eliminate redundant operations to derive the new algorithm; in other work, we applied the same approach to derive improved algorithms for the type-II and type-III DCT and DST [8].

The algorithm for DCT-IV that we present has the same recursive structure as some previous DCT-IV algorithms, but the subtransforms are recursively rescaled in order to eliminate some of the multiplications. This approach reduces the flop count for the DCT-IV from the previous best of 2Nlog2N+N [9], [10], [11], [12], [13], [14], [15] to179Nlog2N+3127N+29(-1)log2Nlog2N-427(-1)log2N.The first savings occur for N=8, and are summarized in Table 1. In order to derive a DCT-IV algorithm from the new FFT algorithm, we simply consider the DCT-IV to be a special case of a DFT with real input of a certain symmetry, and discard the redundant operations. (This should not be confused with algorithms that employ an unmodified FFT combined with O(N) pre/post-processing steps to obtain the DCT.) This well-known technique [1], [2], [6], [7], [8], [16], [17] allows any improvements in the DFT to be immediately translated to the DCT-IV, is simple to derive, avoids cumbersome re-invention of the “same” algorithm for each new trigonometric transform, and (starting with a split-radix FFT) matches the best previous flop counts for every type of DCT and DST. The connection to a DFT of symmetric data can also be viewed as the basic reason why DCT flop counts had not improved for so long: as we review below, the old DCT flop counts can be derived from a split-radix algorithm [14], and the 1968 flop count of split radix was only recently improved upon [1], [18]. There have been many previously published DCT-IV algorithms derived by a variety of techniques, some achieving 2Nlog2N+N flops [9], [10], [11], [12], [13], [14], [15] and others obtaining larger or unreported flop counts [19], [20], [21], [22]. Furthermore, exactly the same flop count (1) is now obtained for the type-IV discrete sine transform (DST-IV), since a DST-IV can be obtained from a DCT-IV via flipping the sign of every other input (zero flops, since the sign changes can be absorbed by converting subsequent additions into subtractions or vice versa) and a permutation of the output (zero flops) [11]. Also, in many practical circumstances the output can be scaled by an arbitrary factor (since any scaling can be absorbed into a subsequent computation); in this case, similar to the well-known savings for a scaled-output size-8 DCT-II in JPEG compression [1], [8], [23], [24], we show that an additional N multiplications can be saved for a scaled-output (or scaled-input) DCT-IV.

Indeed, if we only wished to show that the asymptotic flop count for DCT-IV could be reduced to 179Nlog2N+O(N), we could simply apply known algorithms to express a DCT-IV in terms of a real-input DFT (e.g. by reducing it to the DCT-III [9] and thence to a real-input DFT [25]) to immediately apply the 179Nlog2N+O(N) flop count for a real-input DFT from our previous paper [1]. However, with FFT and DCT algorithms, there is great interest in obtaining not only the best possible asymptotic constant factor, but also the best possible exact count of arithmetic operations. Our result (1) is intended as a new upper bound on this (still unknown) minimum exact count, and therefore we have done our best with the O(N) terms as well as the asymptotic constant.

An important transform closely related to the DCT-IV is an MDCT, which takes 2N inputs and produces N outputs, and is designed to be applied to 50%-overlapped blocks of data [26], [27]. Such a “lapped” transform reduces artifacts from block boundaries and is widely used in audio compression [28]. In fact, an MDCT is exactly equivalent to a DCT-IV of size N, where the 2N inputs have been pre-processed with N additions/subtractions [29], [30], [31], [32]. This means that the flop count for an MDCT is at most that of a DCT-IV plus N flops. Precisely this technique led to the best previous flop count for an MDCT, 2Nlog2N+2N [29], [30], [31], [32]. (There have also been several MDCT algorithms published with larger or unreported flop counts [33], [34], [35], [36], [37], [38], [39], [40], [41], [42].) It also means that our improved DCT-IV immediately produces an improved MDCT, with a flop count of Eq. (1) plus N. Similarly for the inverse MDCT (IMDCT), which takes N inputs to 2N outputs and is equivalent to a DCT-IV of size N plus N negations (which should not, we argue, be counted in the flops because they can be absorbed by converting subsequent additions into subtractions).

In the following sections, we first briefly review the new FFT algorithm, previously described in detail [1]. Then, we review how a DCT-IV may be expressed as a special case of a real DFT, and how the new DCT-IV algorithm may be derived by applying the new FFT algorithm and pruning the redundant operations. In doing so, we find it necessary to develop an algorithm for a DCT-III with scaled output. (Previously, we had derived a fast algorithm for a DCT-III based on the new FFT, but only for the case of scaled or unscaled input [8].) This DCT-III algorithm follows the same approach of eliminating redundant operations from our new scaled-output FFT (a subtransform of the new FFT) applied to appropriate real-symmetric inputs. We then analyze the flop counts for the DCT-III and DCT-IV algorithms. Finally, we show that this improved DCT-IV immediately leads to improved DST-IV, MDCT, and IMDCT algorithms. We close with some concluding remarks about future directions.

Section snippets

Review of the new FFT

To obtain the new FFT, we used as our starting point a variation called the conjugate-pair FFT of the well-known split-radix algorithm. Here, we first review the conjugate-pair FFT, and then briefly summarize how this was modified to reduce the number of flops.

Fast DCT-IV from new FFT

Various forms of discrete cosine transform have been defined, corresponding to different boundary conditions on the transform. The type-IV DCT is defined as a real, linear transformation by the formulaCkIV=n=0N-1xncosπNn+12k+12,for N inputs xn and N outputs CkIV. The transform can be made orthogonal (unitary) by multiplying with a normalization factor 2/N, but for our purposes the unnormalized form is more convenient (and has no effect on the number of operations). We will now derive an

DCT-III from new FFT

The type-III DCT (for a convenient choice of normalization) is defined by Eq. (10) for N inputs xn and N outputs CkIII. We will now follow a process similar to that in the previous section for the DCT-IV: we first express CkIII in terms of a larger DFT of length 4N, then apply the new FFT algorithm of Section 2.2, and finally discard the redundant operations to yield an efficient DCT-III algorithm. The resulting algorithm matches the DCT-III flop count of our previous publication [8], which

Flop counts for DCT-III/IV

First, we will show that Algorithm 3 gives the best previous flop count T(N)=2Nlog2N-N+1 for the DCT-III if the scaling factor s is set to 1. Inspection of Algorithm 2 yields a flop count 4N-2+2T(N/2) for the DCT-IV, and substituting T(N) gives the previous best flop count of 2Nlog2N+N for the DCT-IV. Then, we will analyze how many operations are saved when the scaling factors are included.

If s=1, we can see from Algorithm 3 that a DCT-III of size N is decomposed into a DCT-III of size N/2, a

DST-IV from DCT-IV

The (unnormalized) DST-IV is defined asSkIV=n=0N-1xnsinπNn+12k+12for k=0,,N-1. Although we could derive fast algorithms for SkIV directly by treating it as a DFT of length 8N with odd symmetry, interleaved with zeros, and discarding redundant operations similar to above, it turns out there is a simpler technique. The DST-IV is exactly equivalent to a DCT-IV in which the outputs are reversed and every other input is multiplied by -1 (or vice versa) [11]:SN-1-k=2n=0N-1(-1)nxncosπNn+12k+12for k=

MDCT from DCT-IV

In this section, we will present a new modified DCT (MDCT) algorithm in terms of our new DCT-IV algorithm with an improved flop count compared to the best previously published counts. The key fact is that the best previous flop count for an MDCT of 2N=2m inputs and N outputs was obtained by reducing the problem to a DCT-IV plus N extra additions [29], [30], [31], [32]. Therefore, our improved DCT-IV algorithm immediately gives an improved MDCT. Similarly for the IMDCT, except that in that case

Concluding remarks

We have derived new algorithms for the DCT-IV, DST-IV, MDCT, and IMDCT that reduce the flops for a size N=2m from 2Nlog2N+O(N) to 179Nlog2N+O(N), representing the first improvement in their flop counts for many years and stemming from similar developments for the DFT [1], [18]. We do not claim that these flop counts are the best possible, although we are not currently aware of any way to obtain further reductions in either the leading coefficient or in our O(N) terms. It is possible that

Acknowledgments

This work was supported in part by a grant from the MIT Undergraduate Research Opportunities Program. The authors are also grateful to M. Frigo, co-author of FFTW with S.G.J. [2], for many helpful discussions.

References (58)

  • P. Duhamel et al.

    Split-radix FFT algorithm

    Electron. Lett.

    (1984)
  • J.B. Martens

    Recursive cyclotomic factorization—a new algorithm for calculating the discrete Fourier transform

    IEEE Trans. Acoust. Speech Signal Process.

    (1984)
  • X. Shao, S.G. Johnson, Type-II/III DCT/DST algorithms with reduced number of arithmetic operations, Signal Processing,...
  • Z. Wang

    On computing the discrete Fourier and cosine transforms

    IEEE Trans. Acoust. Speech Signal Process.

    (1985)
  • N. Suehiro et al.

    Fast algorithms for the DFT and other sinusoidal transforms

    IEEE Trans. Acoust. Speech Signal Process.

    (1986)
  • S.C. Chan et al.

    Direct methods for computing discrete sinusoidal transforms

    IEE Proc. F

    (1990)
  • C.W. Kok

    Fast algorithm for computing discrete cosine transform

    IEEE Trans. Signal Process.

    (1997)
  • G. Plonka, M. Tasche, Split-radix algorithms for discrete trigonometric transforms, 2002, online at...
  • P. Duhamel

    Implementation of “split-radix” FFT algorithms for complex, real, and real-symmetric data

    IEEE Trans. Acoust. Speech Signal Process.

    (1986)
  • R. Vuduc, J. Demmel, Code generators for automatic tuning of numerical kernels: experiences with FFTW, in: Proceedings...
  • T. Lundy et al.

    A new matrix approach to real FFTs and convolutions of length 2k

    Computing

    (2007)
  • Z. Wang

    Fast algorithms for the discrete W transform and for the discrete Fourier transform

    IEEE Trans. Acoust. Speech Signal Process.

    (1984)
  • N.R. Murthy et al.

    On the algorithms for the computation of even discrete cosine transform-2 (EDCT-2) of real sequences

    IEEE Trans. Circuits Systems

    (1990)
  • M. Püschel et al.

    The algebraic approach to the discrete cosine and sine transforms and their fast algorithms

    SIAM J. Comput.

    (2003)
  • Y. Arai et al.

    A fast DCT-SQ scheme for images

    Trans. IEICE

    (1988)
  • W.B. Pennebaker et al.

    JPEG Still Image Data Compression Standard

    (1993)
  • M.J. Narasimha et al.

    On the computation of the discrete cosine transform

    IEEE Trans. Commun.

    (1978)
  • J.P. Princen et al.

    Analysis/synthesis filter bank design based on time domain aliasing cancellation

    IEEE Trans. Acoust. Speech Signal Process.

    (1986)
  • T. Painter et al.

    Perceptual coding of digital audio

    Proc. IEEE

    (2000)
  • Cited by (37)

    • Complexity reduction, self/completely recursive, radix-2 DCT I/IV algorithms

      2020, Journal of Computational and Applied Mathematics
      Citation Excerpt :

      mcos2(x,n) The authors of [31] established the total lowest flop counts by reducing the number of multiplications in calculating DCT-IV while leaving an open question: whether the DCT-IV algorithm in [31] leads to practical gains in performance on real computers. To compute DCT-IV, the authors of [31] used twiddle factors but these factors can be overlapped with other operations in the CPU [21].

    • A Wavelet OFDM receiver for baseband power line communications

      2016, Journal of the Franklin Institute
      Citation Excerpt :

      Central to the operation of this system is DCT4e as the basic transform block at both the receiver and the transmitter sides. This transform has played a key role in numerous applications, such as audio signals compression, and thus many fast computational structures can be found (see, e.g., [24–26]). Tables 1 and 2 include the computational cost of some efficient algorithms, and as it can be seen, the algorithm proposed in [26] provides the lower flop count and the lower number of multiplications.

    View all citing articles on Scopus
    View full text