The Common Unified Device Architecture (CUDA) introduced in 2007 by NVIDIA is a recent programming model making use of the unified shader design of the most recent graphics processing units (GPUs). The programming interface allows algorithm implementation using standard C language along with a few extensions without any knowledge about graphics programming using OpenGL, DirectX, and shading languages. We apply this novel technology to the Simultaneous Algebraic Reconstruction Technique (SART), which is an advanced iterative image reconstruction method in cone-beam CT. So far, the computational complexity of this algorithm has prohibited its use in most medical applications. However, since today's GPUs provide a high level of parallelism and are highly cost-efficient processors, they are predestinated for performing the iterative reconstruction according to medical requirements. In this paper we present an efficient implementation of the most time-consuming parts of the iterative reconstruction algorithm: forward- and back-projection. We also explain the required strategy to parallelize the algorithm for the CUDA 1.1 and CUDA 2.0 architecture. Furthermore, our implementation introduces an acceleration technique for the reconstruction compared to a standard SART implementation on the GPU using CUDA. Thus, we present an implementation that can be used in a time-critical clinical environment. Finally, we compare our results to the current applications on multi-core workstations, with respect to both reconstruction speed and (dis-)advantages. Our implementation exhibits a speed-up of more than 64 compared to a state-of-the-art CPU using hardware-accelerated texture interpolation.© (2009) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.