On the efficiency of iterative ordered subset reconstruction algorithms for acceleration on GPUs

https://doi.org/10.1016/j.cmpb.2009.09.003Get rights and content

Abstract

Expectation Maximization (EM) and the Simultaneous Iterative Reconstruction Technique (SIRT) are two iterative computed tomography reconstruction algorithms often used when the data contain a high amount of statistical noise, have been acquired from a limited angular range, or have a limited number of views. A popular mechanism to increase the rate of convergence of these types of algorithms has been to perform the correctional updates within subsets of the projection data. This has given rise to the method of Ordered Subsets EM (OS-EM) and the Simultaneous Algebraic Reconstruction Technique (SART). Commodity graphics hardware (GPUs) has shown great promise to combat the high computational demands incurred by iterative reconstruction algorithms. However, we find that the special architecture and programming model of GPUs add extra constraints on the real-time performance of ordered subsets algorithms, counteracting the speedup benefits of smaller subsets observed on CPUs. This gives rise to new relationships governing the optimal number of subsets as well as relaxation factor settings for obtaining the smallest wall-clock time for reconstruction—a factor that is likely application-dependent. In this paper we study the generalization of SIRT into Ordered Subsets SIRT and show that this allows one to optimize the computational performance of GPU-accelerated iterative algebraic reconstruction methods.

Introduction

The rapid growth in speed and capabilities of programmable commodity graphics hardware boards (GPUs) has propelled high performance computing to the desktop, spawning applications far beyond those used in interactive computer games. High-end graphics boards, such as the NVIDIA 8800 GTX and their successors, featuring 500 G Flops and more, are now available for less than $500, and their performance is consistently growing at a triple of Moore's law that governs the growth of CPUs. Speedups of 1–2 orders of magnitude have been reported by many researchers when mapping CPU-based algorithms onto the GPU, in a wide variety of domains [10], [18], including medical imaging [8], [11], [13], [16]. These impressive gains originate in the highly parallel Same Instruction Multiple Data (SIMD) architecture of the GPU and its high-bandwidth memory access. For example, the NIVIDIA 8800 GTX has 128 such SIMD pipelines while the most recent NVIDIA card, the GTX 295, has 2 × 240 processors to yield a peak performance of 1.7 T Flops.

It is important to note, however, that the high speedup rates facilitated by GPUs do not come easy. They require one to carefully map the target algorithm from the single-threaded programming model of the CPU to the multi-threaded SIMD programming model of the GPU where each such thread is dedicated to computing one element of the (final or intermediate) result vector. Here, special attention must be paid to keep all of these pipelines busy. While there are 100s of SIMD processors on the GPU, many more threads need to be created to hide data fetch latencies. It is important to avoid both thread congestion (too many threads waiting for execution) and thread starvation (not enough threads available to hide latencies). These conditions are in addition to avoiding possible contingencies in local registers and caches that will limit the overall number of threads permitted to run simultaneously. For example, in [13], it was shown that a careful mapping of Feldkamp's filtered backprojection algorithm to the GPU yielded a 20× speedup over an optimized CPU implementation, enabling cone-beam reconstructions of 5123 volumes from 360 projections at a rate of 52 projections/s, greatly exceeding the data production rates of modern flat-panel X-ray scanners that have become popular in fully 3D medical imaging.

The compute-intensive nature of iterative reconstruction algorithms motivated their acceleration via commodity graphics hardware early on, first using graphics workstations [9] and later GPU boards [14]. We now refine these works by analyzing the acceleration of iterative reconstruction algorithms more closely in terms of the underlying GPU programming model and architecture. Iterative algorithms are different from analytical algorithms in that they require frequent synchronization which interrupts the stream of data, requires context switches, and limits the number of threads available for thread management. Iterative algorithms, such as Expectation Maximization (EM) [12] or the Simultaneous Iterative Reconstruction Technique (SIRT) [5] consist of three phases, executed in an iterative fashion: (1) projection of the object estimate, (2) correction factor computation (the updates), and (3) backprojection of the object estimate updates. Each phase requires a separate pass. Flexibility comes from the concept of ordered subsets, which have been originally devised mostly because they accelerated convergence. The projection data is divided into groups, the subsets, and the data within each of these groups undergo each of the three phases simultaneously. Here, it was found that the larger the number of subsets (that is, the smaller the groups) the faster is typically the convergence, but adversely also the higher the noise since there is more potential for over-correction. In EM, the method of Ordered Subsets (OS-EM) has become widely popular. OS-EM conceptually allows for any number of subsets, but the limit with respect to noise has been noted already in the original work by Hudson and Larkin [7]. For the algebraic scheme, embodied by SIRT, the Simultaneous Algebraic Reconstruction Technique (SART) [2] is also an OS scheme, but with each set only consisting of a single projection. In SART, the over-correction noise is kept low by scaling the updates by a relaxation factor λ < 1. Block-iterative schemes for algebraic techniques have been proposed as well [3]. In fact, the original ART [6] is the algorithm with the smallest subset size possible: a single data point (that is, ray or projection pixel).

It is well known that SART converges much faster than SIRT, and a well-chosen λ can overcome the problems with streak artifacts and reconstruction noise, allowing it produce good reconstruction results [1]. On the CPU, faster rate of convergence is directly related to faster time performance. But, as we previously demonstrated [15], when it comes to acceleration on a streaming architecture such as the GPU, SART is not the fastest algorithm in terms of time performance. In fact, the time performance is inversely related to the number subsets, making SIRT the faster scheme. This is due to the overhead incurred by the frequent context switching when repeatedly moving through the three iterative phases: projection, correction, and backprojection. In the same work, we also demonstrated that specific subset sizes can optimize both reconstruction quality and performance. However, the influence of the relaxation factor λ had not been considered in these experiments.

In the research reported here we now complete the study of [15] and also investigate the role of λ on GPU reconstruction speed performance, in relation to subset size. Here we find that an optimal choice of λ can have a great impact. Despite the fact that SART is slower than SIRT per iteration, an optimized λ-setting can reduce the number of required iterations for convergence to a greater extent than the extra per-iteration cost incurred by GPU data streaming and context switching. This reinstates SART and the lower subset versions of SIRT as the fastest iterative CT reconstruction schemes also on GPUs.

We shall note, however, that the optimal setting is likely application-dependent, which is not unique to the GPU setting alone. Here we make the reasonable assumption that a certain application will always incur similar types of data and thus an optimal parameter setting, once found, will likely be close to optimal for all data within that application setting. In that sense, our aim for this paper is not to provide optimal subset and relaxation factor settings for all types of data, but rather to raise awareness to this phenomenon and offer an explanation.

Our paper is structured as follows. First, in Section 2, we discuss iterative algorithms in the context of ordered subsets, present a generalization of SIRT to OS-SIRT, and describe their acceleration on the GPU. Then, in Section 3, we study the impacts of both subset size and relaxation factor on GPU reconstruction performance and present the results of our studies. Finally, Section 4 ends with conclusions.

Section snippets

Iterative reconstruction and its acceleration on GPUs

In the following discussion, we have only considered algebraic reconstruction algorithms (SART, SIRT), but our arguments and conclusions readily extend to Expectation Maximization (EM) algorithms as well since they are very similar with respect to their mapping to GPUs [14].

Experiments and results

All our experiments were conducted on an NVIDIA 8800 GTX GPU, programmed with GLSL. For the first set of experiments we used the 2D Barbara test image to evaluate the performance of the different reconstruction schemes described above. We used this image, popular in the image processing literature, since it has several coherent regions with high-frequency detail, which present a well observable test for the fidelity of a given reconstruction scheme. The target 2D image is obtained by cropping

Conclusions

We have shown that iterative reconstruction methods used in medical imaging, such as EM or SIRT, have special properties when it comes to their acceleration on GPUs. While splitting the data used within each iterative update into a larger number of smaller subsets has long been known to offer greater convergence and computation speed on the CPU, it can be vastly slower on the GPU. This is a direct consequence of the thread fill rate in the projection and backprojection phase. Larger subsets

Conflict of interest

None declared.

Acknowledgements

This work was partially funded by NSF grant CCR-0306438, NIH grant R21 EB004099-01 and GM31627 (DAA) and the Keck Foundation, and the Howard Hughes Medical Institute (DAA).

References (20)

There are more references available in the full text version of this article.

Cited by (41)

  • Scalable double regularization for 3D Nano-CT reconstruction

    2020, Journal of Petroleum Science and Engineering
    Citation Excerpt :

    Although they always demand more computational power, IR methods usually generate less artifacts than FBP. IR encompasses a variety of methods in the literature from an algebraic perspective, such as algebraic reconstruction technique (ART) (Gordon et al., 1970) and its extensions (Andersen and Kak, 1984; Badea and Gordon, 2004), simultaneous iterative reconstruction technique (SIRT) (Gilbert, 1972), ordered subset simultaneous iterative reconstruction technique (OSSIRT) (Xu et al., 2010), iterative coordinate descent (Thibault et al., 2007). Another line of IR methods is from a statistical perspective based on maximum likelihood methods (Lange and Carson, 1984; Manglos et al., 1995).

  • A review of GPU-based medical image reconstruction

    2017, Physica Medica
    Citation Excerpt :

    IR methods are not as straightforward as analytical methods to adapt to the GPU programming model and new strategies must be developed to achieve significant acceleration factors. Xu et al. for instance found that strategies adopted to speed up calculations on CPU, such as using a large number of small subsets in ordered subset algorithms, does not result in the same acceleration on the GPU [105]. Some algorithms are simply too sequential by design to benefit from GPU acceleration.

  • GPU-accelerated iterative reconstruction from Compton scattered data using a matched pair of conic projector and backprojector

    2016, Computer Methods and Programs in Biomedicine
    Citation Excerpt :

    The computation time spent on projection was 552 seconds for CPU-based method, and 3.0 seconds for GPU-based methods.) Note that, the computation times of the GPU-based methods decrease as the number of subsets is decreased [19]. Note also that the computation time of GPU-AP1 in comparison with that of GPU-AP2 is longer because GPU-AP2 does not involve the approximated chord-length computation which is the most time consuming in GPU-AP1.

View all citing articles on Scopus
View full text