On the efficiency of iterative ordered subset reconstruction algorithms for acceleration on GPUs

doi:10.1016/j.cmpb.2009.09.003

Computer Methods and Programs in Biomedicine

Volume 98, Issue 3, June 2010, Pages 261-270

https://doi.org/10.1016/j.cmpb.2009.09.003 Get rights and content

Abstract

Expectation Maximization (EM) and the Simultaneous Iterative Reconstruction Technique (SIRT) are two iterative computed tomography reconstruction algorithms often used when the data contain a high amount of statistical noise, have been acquired from a limited angular range, or have a limited number of views. A popular mechanism to increase the rate of convergence of these types of algorithms has been to perform the correctional updates within subsets of the projection data. This has given rise to the method of Ordered Subsets EM (OS-EM) and the Simultaneous Algebraic Reconstruction Technique (SART). Commodity graphics hardware (GPUs) has shown great promise to combat the high computational demands incurred by iterative reconstruction algorithms. However, we find that the special architecture and programming model of GPUs add extra constraints on the real-time performance of ordered subsets algorithms, counteracting the speedup benefits of smaller subsets observed on CPUs. This gives rise to new relationships governing the optimal number of subsets as well as relaxation factor settings for obtaining the smallest wall-clock time for reconstruction—a factor that is likely application-dependent. In this paper we study the generalization of SIRT into Ordered Subsets SIRT and show that this allows one to optimize the computational performance of GPU-accelerated iterative algebraic reconstruction methods.

Introduction

The rapid growth in speed and capabilities of programmable commodity graphics hardware boards (GPUs) has propelled high performance computing to the desktop, spawning applications far beyond those used in interactive computer games. High-end graphics boards, such as the NVIDIA 8800 GTX and their successors, featuring 500 G Flops and more, are now available for less than $500, and their performance is consistently growing at a triple of Moore's law that governs the growth of CPUs. Speedups of 1–2 orders of magnitude have been reported by many researchers when mapping CPU-based algorithms onto the GPU, in a wide variety of domains [10], [18], including medical imaging [8], [11], [13], [16]. These impressive gains originate in the highly parallel Same Instruction Multiple Data (SIMD) architecture of the GPU and its high-bandwidth memory access. For example, the NIVIDIA 8800 GTX has 128 such SIMD pipelines while the most recent NVIDIA card, the GTX 295, has 2 × 240 processors to yield a peak performance of 1.7 T Flops.

It is important to note, however, that the high speedup rates facilitated by GPUs do not come easy. They require one to carefully map the target algorithm from the single-threaded programming model of the CPU to the multi-threaded SIMD programming model of the GPU where each such thread is dedicated to computing one element of the (final or intermediate) result vector. Here, special attention must be paid to keep all of these pipelines busy. While there are 100s of SIMD processors on the GPU, many more threads need to be created to hide data fetch latencies. It is important to avoid both thread congestion (too many threads waiting for execution) and thread starvation (not enough threads available to hide latencies). These conditions are in addition to avoiding possible contingencies in local registers and caches that will limit the overall number of threads permitted to run simultaneously. For example, in [13], it was shown that a careful mapping of Feldkamp's filtered backprojection algorithm to the GPU yielded a 20× speedup over an optimized CPU implementation, enabling cone-beam reconstructions of 512³ volumes from 360 projections at a rate of 52 projections/s, greatly exceeding the data production rates of modern flat-panel X-ray scanners that have become popular in fully 3D medical imaging.

The compute-intensive nature of iterative reconstruction algorithms motivated their acceleration via commodity graphics hardware early on, first using graphics workstations [9] and later GPU boards [14]. We now refine these works by analyzing the acceleration of iterative reconstruction algorithms more closely in terms of the underlying GPU programming model and architecture. Iterative algorithms are different from analytical algorithms in that they require frequent synchronization which interrupts the stream of data, requires context switches, and limits the number of threads available for thread management. Iterative algorithms, such as Expectation Maximization (EM) [12] or the Simultaneous Iterative Reconstruction Technique (SIRT) [5] consist of three phases, executed in an iterative fashion: (1) projection of the object estimate, (2) correction factor computation (the updates), and (3) backprojection of the object estimate updates. Each phase requires a separate pass. Flexibility comes from the concept of ordered subsets, which have been originally devised mostly because they accelerated convergence. The projection data is divided into groups, the subsets, and the data within each of these groups undergo each of the three phases simultaneously. Here, it was found that the larger the number of subsets (that is, the smaller the groups) the faster is typically the convergence, but adversely also the higher the noise since there is more potential for over-correction. In EM, the method of Ordered Subsets (OS-EM) has become widely popular. OS-EM conceptually allows for any number of subsets, but the limit with respect to noise has been noted already in the original work by Hudson and Larkin [7]. For the algebraic scheme, embodied by SIRT, the Simultaneous Algebraic Reconstruction Technique (SART) [2] is also an OS scheme, but with each set only consisting of a single projection. In SART, the over-correction noise is kept low by scaling the updates by a relaxation factor λ < 1. Block-iterative schemes for algebraic techniques have been proposed as well [3]. In fact, the original ART [6] is the algorithm with the smallest subset size possible: a single data point (that is, ray or projection pixel).

It is well known that SART converges much faster than SIRT, and a well-chosen λ can overcome the problems with streak artifacts and reconstruction noise, allowing it produce good reconstruction results [1]. On the CPU, faster rate of convergence is directly related to faster time performance. But, as we previously demonstrated [15], when it comes to acceleration on a streaming architecture such as the GPU, SART is not the fastest algorithm in terms of time performance. In fact, the time performance is inversely related to the number subsets, making SIRT the faster scheme. This is due to the overhead incurred by the frequent context switching when repeatedly moving through the three iterative phases: projection, correction, and backprojection. In the same work, we also demonstrated that specific subset sizes can optimize both reconstruction quality and performance. However, the influence of the relaxation factor λ had not been considered in these experiments.

In the research reported here we now complete the study of [15] and also investigate the role of λ on GPU reconstruction speed performance, in relation to subset size. Here we find that an optimal choice of λ can have a great impact. Despite the fact that SART is slower than SIRT per iteration, an optimized λ-setting can reduce the number of required iterations for convergence to a greater extent than the extra per-iteration cost incurred by GPU data streaming and context switching. This reinstates SART and the lower subset versions of SIRT as the fastest iterative CT reconstruction schemes also on GPUs.

We shall note, however, that the optimal setting is likely application-dependent, which is not unique to the GPU setting alone. Here we make the reasonable assumption that a certain application will always incur similar types of data and thus an optimal parameter setting, once found, will likely be close to optimal for all data within that application setting. In that sense, our aim for this paper is not to provide optimal subset and relaxation factor settings for all types of data, but rather to raise awareness to this phenomenon and offer an explanation.

Our paper is structured as follows. First, in Section 2, we discuss iterative algorithms in the context of ordered subsets, present a generalization of SIRT to OS-SIRT, and describe their acceleration on the GPU. Then, in Section 3, we study the impacts of both subset size and relaxation factor on GPU reconstruction performance and present the results of our studies. Finally, Section 4 ends with conclusions.

Section snippets

Iterative reconstruction and its acceleration on GPUs

In the following discussion, we have only considered algebraic reconstruction algorithms (SART, SIRT), but our arguments and conclusions readily extend to Expectation Maximization (EM) algorithms as well since they are very similar with respect to their mapping to GPUs [14].

Experiments and results

All our experiments were conducted on an NVIDIA 8800 GTX GPU, programmed with GLSL. For the first set of experiments we used the 2D Barbara test image to evaluate the performance of the different reconstruction schemes described above. We used this image, popular in the image processing literature, since it has several coherent regions with high-frequency detail, which present a well observable test for the fidelity of a given reconstruction scheme. The target 2D image is obtained by cropping

Conclusions

We have shown that iterative reconstruction methods used in medical imaging, such as EM or SIRT, have special properties when it comes to their acceleration on GPUs. While splitting the data used within each iterative update into a larger number of smaller subsets has long been known to offer greater convergence and computation speed on the CPU, it can be vastly slower on the GPU. This is a direct consequence of the thread fill rate in the projection and backprojection phase. Larger subsets

Conflict of interest

None declared.

Acknowledgements

This work was partially funded by NSF grant CCR-0306438, NIH grant R21 EB004099-01 and GM31627 (DAA) and the Keck Foundation, and the Howard Hughes Medical Institute (DAA).

References (20)

A. Andersen et al.
Simultaneous Algebraic Reconstruction Technique (SART): a superior implementation of the ART algorithm
Ultrasonic Imaging
(1984)
P. Gilbert
Iterative methods for the 3D reconstruction of an object from projections
Journal of Theoretical Biology
(1972)
R. Gordon et al.
Algebraic Reconstruction Techniques (ART) for three-dimensional electron microscopy and X-ray photography
Journal of Theoretical Biology
(1970)
A. Andersen
Algebraic reconstruction in CT from limited views
IEEE Transactions on Medical Imaging
(1989)
Y. Censor
On block-iterative maximization
Journal of Information and Optimization Science
(1987)
R. Fernando et al.
The Cg Tutorial: The Definitive Guide to Programmable Real-time Graphics
(2003)
H. Hudson et al.
Accelerated image reconstruction using ordered subsets of projection data
IEEE Transactions on Medical Imaging
(1994)
J. Kole et al.
Evaluation of accelerated iterative X-ray CT image reconstruction using floating point graphics hardware
Physics in Medicine and Biology
(2006)
K. Mueller et al.
Rapid 3D cone-beam reconstruction with the Algebraic Reconstruction Technique (ART) by using texture mapping hardware
IEEE Transactions on Medical Imaging
(2000)
J. Owens et al.
A survey of general-purpose computation on graphics hardware
Eurographics
(2005)

There are more references available in the full text version of this article.

Cited by (41)

Scalable double regularization for 3D Nano-CT reconstruction
2020, Journal of Petroleum Science and Engineering
Citation Excerpt :
Although they always demand more computational power, IR methods usually generate less artifacts than FBP. IR encompasses a variety of methods in the literature from an algebraic perspective, such as algebraic reconstruction technique (ART) (Gordon et al., 1970) and its extensions (Andersen and Kak, 1984; Badea and Gordon, 2004), simultaneous iterative reconstruction technique (SIRT) (Gilbert, 1972), ordered subset simultaneous iterative reconstruction technique (OSSIRT) (Xu et al., 2010), iterative coordinate descent (Thibault et al., 2007). Another line of IR methods is from a statistical perspective based on maximum likelihood methods (Lange and Carson, 1984; Manglos et al., 1995).
Nano-CT (computerized tomography) has emerged as a non-destructive high-resolution cross-sectional imaging technique to effectively study the sub- $μ m$ pore structure of shale, which is of fundamental importance to the evaluation and development of shale oil and gas. Nano-CT poses unique challenges to the inverse problem of reconstructing the 3D structure due to the lower signal-to-noise ratio (than Micro-CT) at the nano-scale, increased sensitivity to the misaligned geometry caused by the movement of object manipulator, limited sample size, and a larger volume of data at higher resolution. We propose a scalable double regularization (SDR) method to utilize the entire dataset for simultaneous 3D structural reconstruction across slices through total variation regularization within slices and $L_{1}$ regularization between adjacent slices. SDR allows information borrowing both within and between slices, contrasting with the traditional methods that usually build on slice by slice reconstruction. We develop a scalable and memory-efficient algorithm by exploiting the systematic sparsity and consistent geometry induced by such Nano-CT data. We illustrate the proposed method using synthetic data and two Nano-CT imaging datasets of Jiulaodong (JLD) shale and Longmaxi (LMX) shale acquired in the Sichuan Basin. These numerical experiments show that the proposed method substantially outperforms selected alternatives both visually and quantitatively.
A review of GPU-based medical image reconstruction
2017, Physica Medica
Citation Excerpt :
IR methods are not as straightforward as analytical methods to adapt to the GPU programming model and new strategies must be developed to achieve significant acceleration factors. Xu et al. for instance found that strategies adopted to speed up calculations on CPU, such as using a large number of small subsets in ordered subset algorithms, does not result in the same acceleration on the GPU [105]. Some algorithms are simply too sequential by design to benefit from GPU acceleration.
Tomographic image reconstruction is a computationally demanding task, even more so when advanced models are used to describe a more complete and accurate picture of the image formation process. Such advanced modeling and reconstruction algorithms can lead to better images, often with less dose, but at the price of long calculation times that are hardly compatible with clinical workflows. Fortunately, reconstruction tasks can often be executed advantageously on Graphics Processing Units (GPUs), which are exploited as massively parallel computational engines. This review paper focuses on recent developments made in GPU-based medical image reconstruction, from a CT, PET, SPECT, MRI and US perspective. Strategies and approaches to get the most out of GPUs in image reconstruction are presented as well as innovative applications arising from an increased computing capacity. The future of GPU-based image reconstruction is also envisioned, based on current trends in high-performance computing.
GPU-accelerated iterative reconstruction from Compton scattered data using a matched pair of conic projector and backprojector
2016, Computer Methods and Programs in Biomedicine
Citation Excerpt :
The computation time spent on projection was 552 seconds for CPU-based method, and 3.0 seconds for GPU-based methods.) Note that, the computation times of the GPU-based methods decrease as the number of subsets is decreased [19]. Note also that the computation time of GPU-AP1 in comparison with that of GPU-AP2 is longer because GPU-AP2 does not involve the approximated chord-length computation which is the most time consuming in GPU-AP1.
Iterative reconstruction from Compton scattered data is known to be computationally more challenging than that from conventional line-projection based emission data in that the gamma rays that undergo Compton scattering are modeled as conic projections rather than line projections. In conventional tomographic reconstruction, to parallelize the projection and backprojection operations using the graphics processing unit (GPU), approximated methods that use an unmatched pair of ray-tracing forward projector and voxel-driven backprojector have been widely used. In this work, we propose a new GPU-accelerated method for Compton camera reconstruction which is more accurate by using exactly matched pair of projector and backprojector.
To calculate conic forward projection, we first sample the cone surface into conic rays and accumulate the intersecting chord lengths of the conic rays passing through voxels using a fast ray-tracing method (RTM). For conic backprojection, to obtain the true adjoint of the conic forward projection, while retaining the computational efficiency of the GPU, we use a voxel-driven RTM which is essentially the same as the standard RTM used for the conic forward projector.
Our simulation results show that, while the new method is about 3 times slower than the approximated method, it is still about 16 times faster than the CPU-based method without any loss of accuracy.
The net conclusion is that our proposed method is guaranteed to retain the reconstruction accuracy regardless of the number of iterations by providing a perfectly matched projector–backprojector pair, which makes iterative reconstruction methods for Compton imaging faster and more accurate.
The role of real-time in biomedical science: A meta-analysis on computational complexity, delay and speedup
2015, Computers in Biology and Medicine
Purpose: The concept of real-time is very important, as it deals with the realizability of computer based health care systems.
Method: In this paper we review biomedical real-time systems with a meta-analysis on computational complexity (CC), delay ( $Δ$ ) and speedup (S_p).
Results: During the review we found that, in the majority of papers, the term real-time is part of the thesis indicating that a proposed system or algorithm is practical. However, these papers were not considered for detailed scrutiny. Our detailed analysis focused on papers which support their claim of achieving real-time, with a discussion on CC or S_p. These papers were analyzed in terms of processing system used, application area (AA), CC, $Δ$ , S_p, implementation/algorithm (I/A) and competition.
Conclusions: The results show that the ideas of parallel processing and algorithm delay were only recently introduced and journal papers focus more on Algorithm (A) development than on implementation (I). Most authors compete on big $O$ notation (O) and processing time (PT). Based on these results, we adopt the position that the concept of real-time will continue to play an important role in biomedical systems design. We predict that parallel processing considerations, such as S_p and algorithm scaling, will become more important.
Medical image processing on the GPU - Past, present and future
2013, Medical Image Analysis
Graphics processing units (GPUs) are used today in a wide range of applications, mainly because they can dramatically accelerate parallel computing, are affordable and energy efficient. In the field of medical imaging, GPUs are in some cases crucial for enabling practical use of computationally demanding algorithms. This review presents the past and present work on GPU accelerated medical image processing, and is meant to serve as an overview and introduction to existing GPU implementations. The review covers GPU acceleration of basic image processing operations (filtering, interpolation, histogram estimation and distance transforms), the most commonly used algorithms in medical imaging (image registration, image segmentation and image denoising) and algorithms that are specific to individual modalities (CT, PET, SPECT, MRI, fMRI, DTI, ultrasound, optical imaging and microscopy). The review ends by highlighting some future possibilities and challenges.
Metal-induced streak artifact reduction using iterative reconstruction algorithms in x-ray computed tomography image of the dentoalveolar region
2013, Oral Surgery, Oral Medicine, Oral Pathology and Oral Radiology
The objective of this study was to reduce metal-induced streak artifact on oral and maxillofacial x-ray computed tomography (CT) images by developing the fast statistical image reconstruction system using iterative reconstruction algorithms.
Adjacent CT images often depict similar anatomical structures in thin slices. So, first, images were reconstructed using the same projection data of an artifact-free image. Second, images were processed by the successive iterative restoration method where projection data were generated from reconstructed image in sequence. Besides the maximum likelihood-expectation maximization algorithm, the ordered subset-expectation maximization algorithm (OS-EM) was examined. Also, small region of interest (ROI) setting and reverse processing were applied for improving performance.
Both algorithms reduced artifacts instead of slightly decreasing gray levels. The OS-EM and small ROI reduced the processing duration without apparent detriments. Sequential and reverse processing did not show apparent effects.
Two alternatives in iterative reconstruction methods were effective for artifact reduction. The OS-EM algorithm and small ROI setting improved the performance.

View all citing articles on Scopus

View full text

On the efficiency of iterative ordered subset reconstruction algorithms for acceleration on GPUs

Abstract

Introduction

Section snippets

Iterative reconstruction and its acceleration on GPUs

Experiments and results

Conclusions

Conflict of interest

Acknowledgements

Ultrasonic Imaging

Journal of Theoretical Biology

Journal of Theoretical Biology

Algebraic reconstruction in CT from limited views

IEEE Transactions on Medical Imaging

On block-iterative maximization

Journal of Information and Optimization Science

The Cg Tutorial: The Definitive Guide to Programmable Real-time Graphics

Accelerated image reconstruction using ordered subsets of projection data

IEEE Transactions on Medical Imaging

Evaluation of accelerated iterative X-ray CT image reconstruction using floating point graphics hardware

Physics in Medicine and Biology

Rapid 3D cone-beam reconstruction with the Algebraic Reconstruction Technique (ART) by using texture mapping hardware

IEEE Transactions on Medical Imaging

A survey of general-purpose computation on graphics hardware

Eurographics