Type-IV DCT, DST, and MDCT algorithms with reduced numbers of arithmetic operations
Introduction
In this paper, we present recursive algorithms for type-IV discrete cosine and sine transforms (DCT-IV and DST-IV) and modified discrete cosine transforms (MDCTs), of power-of-two sizes N, that require fewer total real additions and multiplications (herein called flops) than previously published algorithms (with an asymptotic reduction of about 6%), without sacrificing numerical accuracy. This work, extending our previous results for small fixed N [1], appears to be the first time in over 20 years that flop counts for the DCT-IV and MDCT have been reduced—although computation times are no longer generally determined by arithmetic counts [2], the question of the minimum number of flops remains of fundamental theoretical interest. Our fast algorithms are based on a recently published fast Fourier transform (FFT) algorithm, which reduced the operation count for the discrete Fourier transform (DFT) of size N to [1] compared to the (previous best) split-radix algorithm's [3], [4], [5], [6], [7]. Given the new FFT, we treat a DCT as an FFT of real-symmetric inputs and eliminate redundant operations to derive the new algorithm; in other work, we applied the same approach to derive improved algorithms for the type-II and type-III DCT and DST [8].
The algorithm for DCT-IV that we present has the same recursive structure as some previous DCT-IV algorithms, but the subtransforms are recursively rescaled in order to eliminate some of the multiplications. This approach reduces the flop count for the DCT-IV from the previous best of [9], [10], [11], [12], [13], [14], [15] toThe first savings occur for , and are summarized in Table 1. In order to derive a DCT-IV algorithm from the new FFT algorithm, we simply consider the DCT-IV to be a special case of a DFT with real input of a certain symmetry, and discard the redundant operations. (This should not be confused with algorithms that employ an unmodified FFT combined with pre/post-processing steps to obtain the DCT.) This well-known technique [1], [2], [6], [7], [8], [16], [17] allows any improvements in the DFT to be immediately translated to the DCT-IV, is simple to derive, avoids cumbersome re-invention of the “same” algorithm for each new trigonometric transform, and (starting with a split-radix FFT) matches the best previous flop counts for every type of DCT and DST. The connection to a DFT of symmetric data can also be viewed as the basic reason why DCT flop counts had not improved for so long: as we review below, the old DCT flop counts can be derived from a split-radix algorithm [14], and the 1968 flop count of split radix was only recently improved upon [1], [18]. There have been many previously published DCT-IV algorithms derived by a variety of techniques, some achieving flops [9], [10], [11], [12], [13], [14], [15] and others obtaining larger or unreported flop counts [19], [20], [21], [22]. Furthermore, exactly the same flop count (1) is now obtained for the type-IV discrete sine transform (DST-IV), since a DST-IV can be obtained from a DCT-IV via flipping the sign of every other input (zero flops, since the sign changes can be absorbed by converting subsequent additions into subtractions or vice versa) and a permutation of the output (zero flops) [11]. Also, in many practical circumstances the output can be scaled by an arbitrary factor (since any scaling can be absorbed into a subsequent computation); in this case, similar to the well-known savings for a scaled-output size-8 DCT-II in JPEG compression [1], [8], [23], [24], we show that an additional N multiplications can be saved for a scaled-output (or scaled-input) DCT-IV.
Indeed, if we only wished to show that the asymptotic flop count for DCT-IV could be reduced to , we could simply apply known algorithms to express a DCT-IV in terms of a real-input DFT (e.g. by reducing it to the DCT-III [9] and thence to a real-input DFT [25]) to immediately apply the flop count for a real-input DFT from our previous paper [1]. However, with FFT and DCT algorithms, there is great interest in obtaining not only the best possible asymptotic constant factor, but also the best possible exact count of arithmetic operations. Our result (1) is intended as a new upper bound on this (still unknown) minimum exact count, and therefore we have done our best with the terms as well as the asymptotic constant.
An important transform closely related to the DCT-IV is an MDCT, which takes inputs and produces N outputs, and is designed to be applied to 50%-overlapped blocks of data [26], [27]. Such a “lapped” transform reduces artifacts from block boundaries and is widely used in audio compression [28]. In fact, an MDCT is exactly equivalent to a DCT-IV of size N, where the inputs have been pre-processed with N additions/subtractions [29], [30], [31], [32]. This means that the flop count for an MDCT is at most that of a DCT-IV plus N flops. Precisely this technique led to the best previous flop count for an MDCT, [29], [30], [31], [32]. (There have also been several MDCT algorithms published with larger or unreported flop counts [33], [34], [35], [36], [37], [38], [39], [40], [41], [42].) It also means that our improved DCT-IV immediately produces an improved MDCT, with a flop count of Eq. (1) plus N. Similarly for the inverse MDCT (IMDCT), which takes N inputs to outputs and is equivalent to a DCT-IV of size N plus N negations (which should not, we argue, be counted in the flops because they can be absorbed by converting subsequent additions into subtractions).
In the following sections, we first briefly review the new FFT algorithm, previously described in detail [1]. Then, we review how a DCT-IV may be expressed as a special case of a real DFT, and how the new DCT-IV algorithm may be derived by applying the new FFT algorithm and pruning the redundant operations. In doing so, we find it necessary to develop an algorithm for a DCT-III with scaled output. (Previously, we had derived a fast algorithm for a DCT-III based on the new FFT, but only for the case of scaled or unscaled input [8].) This DCT-III algorithm follows the same approach of eliminating redundant operations from our new scaled-output FFT (a subtransform of the new FFT) applied to appropriate real-symmetric inputs. We then analyze the flop counts for the DCT-III and DCT-IV algorithms. Finally, we show that this improved DCT-IV immediately leads to improved DST-IV, MDCT, and IMDCT algorithms. We close with some concluding remarks about future directions.
Section snippets
Review of the new FFT
To obtain the new FFT, we used as our starting point a variation called the conjugate-pair FFT of the well-known split-radix algorithm. Here, we first review the conjugate-pair FFT, and then briefly summarize how this was modified to reduce the number of flops.
Fast DCT-IV from new FFT
Various forms of discrete cosine transform have been defined, corresponding to different boundary conditions on the transform. The type-IV DCT is defined as a real, linear transformation by the formulafor N inputs and N outputs . The transform can be made orthogonal (unitary) by multiplying with a normalization factor , but for our purposes the unnormalized form is more convenient (and has no effect on the number of operations). We will now derive an
DCT-III from new FFT
The type-III DCT (for a convenient choice of normalization) is defined by Eq. (10) for N inputs and N outputs . We will now follow a process similar to that in the previous section for the DCT-IV: we first express in terms of a larger DFT of length , then apply the new FFT algorithm of Section 2.2, and finally discard the redundant operations to yield an efficient DCT-III algorithm. The resulting algorithm matches the DCT-III flop count of our previous publication [8], which
Flop counts for DCT-III/IV
First, we will show that Algorithm 3 gives the best previous flop count for the DCT-III if the scaling factor s is set to 1. Inspection of Algorithm 2 yields a flop count for the DCT-IV, and substituting gives the previous best flop count of for the DCT-IV. Then, we will analyze how many operations are saved when the scaling factors are included.
If , we can see from Algorithm 3 that a DCT-III of size N is decomposed into a DCT-III of size , a
DST-IV from DCT-IV
The (unnormalized) DST-IV is defined asfor . Although we could derive fast algorithms for directly by treating it as a DFT of length with odd symmetry, interleaved with zeros, and discarding redundant operations similar to above, it turns out there is a simpler technique. The DST-IV is exactly equivalent to a DCT-IV in which the outputs are reversed and every other input is multiplied by (or vice versa) [11]:for
MDCT from DCT-IV
In this section, we will present a new modified DCT (MDCT) algorithm in terms of our new DCT-IV algorithm with an improved flop count compared to the best previously published counts. The key fact is that the best previous flop count for an MDCT of inputs and N outputs was obtained by reducing the problem to a DCT-IV plus N extra additions [29], [30], [31], [32]. Therefore, our improved DCT-IV algorithm immediately gives an improved MDCT. Similarly for the IMDCT, except that in that case
Concluding remarks
We have derived new algorithms for the DCT-IV, DST-IV, MDCT, and IMDCT that reduce the flops for a size from to , representing the first improvement in their flop counts for many years and stemming from similar developments for the DFT [1], [18]. We do not claim that these flop counts are the best possible, although we are not currently aware of any way to obtain further reductions in either the leading coefficient or in our terms. It is possible that
Acknowledgments
This work was supported in part by a grant from the MIT Undergraduate Research Opportunities Program. The authors are also grateful to M. Frigo, co-author of FFTW with S.G.J. [2], for many helpful discussions.
References (58)
- et al.
Simple FFT and DCT algorithms with reduced number of operations
Signal Processing
(1984) - et al.
Fast Fourier transforms: a tutorial review and a state of the art
Signal Processing
(1990) - et al.
The fast generalized discrete Fourier transforms: a unified approach to the discrete sinusoidal transforms computation
Signal Processing
(1999) The fast DCT-IV/DST-IV computation via the MDCT
Signal Processing
(2003)- et al.
Fast and numerically stable algorithms for discrete cosine transforms
Linear Algebra Appl.
(2005) - et al.
Adaptive transform coding incorporating time domain aliasing cancellation
Speech Commun.
(1987) - et al.
A new fast algorithm for the unified forward and inverse MDCT/MDST computation
Signal Processing
(2002) - et al.
A modified split-radix FFT with fewer arithmetic operations
IEEE Trans. Signal Process.
(2007) - et al.
The design and implementation of FFTW3
Proc. IEEE
(2005) - R. Yavne, An economical method for calculating the discrete Fourier transform, in: Proceedings of the AFIPS Fall Joint...
Split-radix FFT algorithm
Electron. Lett.
Recursive cyclotomic factorization—a new algorithm for calculating the discrete Fourier transform
IEEE Trans. Acoust. Speech Signal Process.
On computing the discrete Fourier and cosine transforms
IEEE Trans. Acoust. Speech Signal Process.
Fast algorithms for the DFT and other sinusoidal transforms
IEEE Trans. Acoust. Speech Signal Process.
Direct methods for computing discrete sinusoidal transforms
IEE Proc. F
Fast algorithm for computing discrete cosine transform
IEEE Trans. Signal Process.
Implementation of “split-radix” FFT algorithms for complex, real, and real-symmetric data
IEEE Trans. Acoust. Speech Signal Process.
A new matrix approach to real FFTs and convolutions of length
Computing
Fast algorithms for the discrete transform and for the discrete Fourier transform
IEEE Trans. Acoust. Speech Signal Process.
On the algorithms for the computation of even discrete cosine transform-2 (EDCT-2) of real sequences
IEEE Trans. Circuits Systems
The algebraic approach to the discrete cosine and sine transforms and their fast algorithms
SIAM J. Comput.
A fast DCT-SQ scheme for images
Trans. IEICE
JPEG Still Image Data Compression Standard
On the computation of the discrete cosine transform
IEEE Trans. Commun.
Analysis/synthesis filter bank design based on time domain aliasing cancellation
IEEE Trans. Acoust. Speech Signal Process.
Perceptual coding of digital audio
Proc. IEEE
Cited by (37)
Complexity reduction, self/completely recursive, radix-2 DCT I/IV algorithms
2020, Journal of Computational and Applied MathematicsCitation Excerpt :mcos2(x,n) The authors of [31] established the total lowest flop counts by reducing the number of multiplications in calculating DCT-IV while leaving an open question: whether the DCT-IV algorithm in [31] leads to practical gains in performance on real computers. To compute DCT-IV, the authors of [31] used twiddle factors but these factors can be overlapped with other operations in the CPU [21].
Design of a filter bank multi-carrier system for broadband power line communications
2016, Signal ProcessingA Wavelet OFDM receiver for baseband power line communications
2016, Journal of the Franklin InstituteCitation Excerpt :Central to the operation of this system is DCT4e as the basic transform block at both the receiver and the transmitter sides. This transform has played a key role in numerous applications, such as audio signals compression, and thus many fast computational structures can be found (see, e.g., [24–26]). Tables 1 and 2 include the computational cost of some efficient algorithms, and as it can be seen, the algorithm proposed in [26] provides the lower flop count and the lower number of multiplications.
Generalized fast mixed-radix algorithm for the computation of forward and inverse MDCTs
2012, Signal Processing