GPGPU |
General-Purpose Computation Using Graphics Hardware
|
IntroductionGPGPU stands for General-Purpose computation on GPUs. With the increasing programmability of commodity graphics processing units (GPUs), these chips are capable of performing more than the specific graphics computations for which they were designed. They are now capable coprocessors, and their high speed makes them useful for a variety of applications. The goal of this page is to catalog the current and historical use of GPUs for general-purpose computation.
|
Concurrent Number Cruncher: a GPU implementation of a general sparse linear solver A wide class of numerical methods needs to solve a linear system, where the matrix pattern of non-zero coefficients can be arbitrary. These problems can greatly benefit from highly multithreaded computational power and large memory bandwidth available on GPUs, especially since dedicated general purpose APIs such as CTM (AMD-ATI) and CUDA (NVIDIA) have appeared. CUDA even provides a BLAS implementation, but only for dense matrices (CuBLAS). Other existing linear solvers for the GPU are also limited by their internal matrix representation. This paper describes how to combine recent GPU programming techniques and new GPU dedicated APIs with high performance computing strategies (namely block compressed row storage, register blocking and vectorization), to implement a sparse general-purpose linear solver. This implementation of the Jacobi-preconditioned Conjugate
Gradient algorithm outperforms by up to a factor of 6.0x leading-edge CPU counterparts, making it attractive for applications which are content with single precision. (Concurrent number cruncher - A GPU implementation of a general sparse linear solver. Luc Buatois, Guillaume Caumon and Bruno Lévy. International Journal of Parallel, Emergent and Distributed Systems. To Appear.)
Posted: 16 Oct 2008 [GPGPU /Scientific Computing/Numerical Algorithms] # Abstract:
In a previous publication, we have examined the fundamental difference between computational precision and result accuracy in the context of the iterative solution of linear systems as they typically arise in the Finite Element discretization of Partial Differential Equations (PDEs). In particular, we evaluated mixed- and emulated-precision schemes on commodity graphics processors (GPUs), which at that time only supported computations in single precision. With the advent of graphics cards that natively provide double precision, this report updates our previous results. We demonstrate that with new co-processor hardware supporting native double precision, such as NVIDIA's G200 and T10 architectures, the situation does not change qualitatively for PDEs, and the previously introduced mixed precision schemes are still preferable to double precision alone. But the schemes achieve significant quantitative performance improvements with the more powerful hardware. In particular, we demonstrate that a Multigrid scheme can accurately solve a common test problem in Finite Element settings with one million unknowns in less than 0.1 seconds, which is truely outstanding performance. We support these conclusions by exploring the algorithmic design space enlarged by the availability of double precision directly in the hardware. (Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations (Part 2: Double Precision GPUs). Dominik Göddeke and Robert Strzodka. Technical Report, 2008.)Posted: 14 Jul 2008 [GPGPU /Scientific Computing] # PRACE award presented to young scientist at ISC’08 for GPGPU work From this article: "PRACE, Partnership for Advanced Computing in Europe, awarded a prize for the best scientific paper submitted to ISC’08 by a European student or young scientist on petascaling. The authors of the award winning paper are Stefan Turek, Dominik Göddeke, Christian Becker, Sven H.M. Buijssen and Hilmar Wobker from the Institute of Applied Mathematics, Dortmund University of Technology, Germany. Their work, UCHPC – UnConventional High Performance Computing for Finite Element Simulations, was selected by the ISC’08 Award Committee, headed by Michael Resch, High Performance Computing Center Stuttgart. Achim Bachem, Chairman of the Board Forschungszentrum Jülich and PRACE coordinator presented the PRACE Award at the ISC’08 opening ceremony in Dresden on Wednesday, 18 June. Dominik Göddeke, Ph.D. student in the team of Professor Stefan Turek will receive a sponsorship for the participation in a conference relevant to Petascale computing." Dominik has been an active GPGPU researcher for several years, and is one of the most active and helpful contributors to the GPGPU.org forums.
(PRACE award presented to young scientist at ISC’08)
Posted: 20 Jun 2008 [GPGPU /Scientific Computing] # Co-Processor Acceleration of an Unmodified Parallel Solid Mechanics Code with FEASTGPU FEAST is a hardware-oriented MPI-based Finite Element solver toolkit. With the extension FEASTGPU the authors have previously demonstrated that significant speed-ups in the solution of the scalar Poisson problem can be achieved by the addition of GPUs as scientific co-processors to a commodity based cluster. In this paper the authors put the more general claim to the test: Applications based on FEAST, that ran only on CPUs so far, can be successfully accelerated on a co-processor enhanced cluster without any code modifications. The chosen solid mechanics code has higher accuracy requirements and a more diverse CPU/co-processor interaction than the Poisson example, and is thus better suited to assess the practicability of the acceleration approach. The paper presents accuracy experiments, a scalability test and acceleration results for different elastic objects under load. In particular, it demonstrates in detail that the single precision execution of the co-processor does not affect the final accuracy. The paper establishes how the local acceleration gains of factors 5.5 to 9.0 translate into 1.6- to 2.6-fold total speed-up. Subsequent analysis reveals which measures will increase these factors further. (Dominik Göddeke, Hilmar Wobker, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick McCormick, Stefan Turek. Co-Processor Acceleration of an Unmodified Parallel Solid Mechanics Code with FEASTGPU. International Journal of Computational Science and Engineering (to appear).)
Posted: 06 Jun 2008 [GPGPU /Scientific Computing] # GPU acceleration of cutoff pair potentials for molecular modeling applications The advent of systems biology requires the simulation of ever-larger biomolecular systems, demanding a commensurate growth in computational power. This paper examines the use of the NVIDIA Tesla C870 graphics card programmed through the CUDA toolkit to accelerate the calculation of cutoff pair potentials, one of the most prevalent computations required by many different molecular modeling applications. The paper presents algorithms to calculate electrostatic potential maps for cutoff pair potentials. Whereas a straightforward approach for decomposing atom data leads to low computational efficiency, a new strategy enables fine-grained spatial decomposition of atom data that maps efficiently to the C870's memory system while increasing work efficiency of atom data traversal by a factor of 5. The memory addressing flexibility exposed through CUDA's SPMD programming model is crucial in enabling this new strategy. An implementation of the new algorithm provides a greater than threefold performance improvement over our previously published implementation and runs 12 to 20 times faster than optimized CPU-only code. The lessons learned are generally applicable to algorithms accelerated by uniform grid spatial decomposition. (C. I. Rodrigues, D. J. Hardy, J. E. Stone, K. Schulten, W. W. Hwu.,
GPU acceleration of cutoff pair potentials for molecular modeling applications. Proceedings of the 2008 Conference On Computing Frontiers, pp.273-282, 2008.) (http://www.ks.uiuc.edu/Research/gpu/)
Posted: 25 May 2008 [GPGPU /Scientific Computing] # Abstract: "The graphics processing unit (GPU) has become an integral part of today's mainstream computing systems. Over the past six years, there has been a marked increase in the performance and capabilities of GPUs. The modern GPU is not only a powerful graphics engine but also a highly parallel programmable processor featuring peak arithmetic and memory andwidth that substantially outpaces its CPU counterpart. The GPU's rapid increase in both programmability and capability has spawned a research community that has successfully mapped a broad range of computationally demanding, complex problems to the GPU. This effort in general-purpose computing on the GPU, also known as GPU computing, has positioned the GPU as a compelling alternative to traditional microprocessors in high-performance computer systems of the future. We describe the background, hardware, and programming model for GPU computing, summarize the state of the art in tools and techniques, and present four GPU computing successes in game physics and computational biophysics that deliver order-of-magnitude performance gains over optimized CPU applications. (J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, J. C. Phillips, "GPU Computing", Proceedings of the IEEE, vol.96, no.5, pp.879-899, May 2008)
Posted: 25 May 2008 [GPGPU /Scientific Computing] # CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment The Smith-Waterman algorithm has been available for more than 25 years. It is based on a dynamic programming approach that explores all the possible alignments between two biological sequences; as a result it returns the optimal local alignment. Unfortunately, the computational cost is very high, requiring a number of operations proportional to the product of the length of two sequences. This paper by Svetlin Manavski and Giorgio Valle describes SmithWaterman-CUDA, an open-source project to perform fast sequence alignment on the GPU. Although the software performs the optimal Smith-Waterman alignment it is faster than heuristics approaches like FASTA and BLAST. The tests on protein data banks show up to 30x speed up related to reference CPU implementations. (Svetlin A. Manavski, Giorgio Valle, CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment, BMC Bioinformatics 2008, 9(Suppl 2):S10 (26 March 2008))
Posted: 02 Apr 2008 [GPGPU /Scientific Computing] # A SIMD interpreter for Genetic Programming on GPU Graphics Cards Abstract: Mackey-Glass chaotic time series prediction and nuclear protein classification show the feasibility of evaluating genetic programming populations directly on parallel consumer gaming graphics processing units. Using a Linux KDE computer equipped with an nVidia GeForce 8800 GTX graphics processing unit card the C++ SPMD interpretter evolves programs at Giga GP operations per second (895 million GPops). We use the RapidMind general processing on GPU (GPGPU) framework to evaluate an entire population of a quarter of a million individual programs on a non-trivial problem in 4 seconds. An efficient reverse polish notation (RPN) tree based GP is given. (A SIMD interpreter for Genetic Programming on GPU Graphics Cards. W.B. Langdon and W. Banzhaf. In M. Neill, L. Vanneschi, A.I. Esparcia Alcazar, S. Gustafson eds., EuroGP 2008, pp73-85. Springer, LNCS 4971, 26-28 March, Naples.)
Posted: 02 Apr 2008 [GPGPU /Scientific Computing] # Ivan Ufimtsev and Todd Martínez at the University of Illinois at Urbana-Champaign have implemented an efficient method of calculating two-electron repulsion integrals over Gaussian basis functions on the GPU. Virtually all modern quantum chemical calculations require evaluating millions to billions of these integrals. This problem turns out to be well-suited to the massively parallel architecture of GPUs by an appropriate partitioning of the problem. A benchmark test performed for the evaluation of approximately one million (ss|ss) integrals over contracted s-orbitals showed that a naïve algorithm implemented on the GPU achieves up to 130-fold speedup over a traditional CPU implementation on an AMD Opteron. Subsequent calculations on a 256-atom DNA strand show that the GPU advantage is maintained for basis sets including higher angular momentum functions.
(Quantum Chemistry on Graphical Processing Units. 1. Strategies for
Two-Electron Integral Evaluation, Ivan S. Ufimtsev and Todd J. Martínez, J. Chem. Theory Comput., 4 (2), 222 -231, 2008. doi:10.1021/ct700268q)
Posted: 01 Apr 2008 [GPGPU /Scientific Computing] # In this paper we describe a modification of a general purpose code for quantum mechanical calculations of molecular properties (Q-Chem) to use a graphical processing unit. We report a 4.3x speedup of the resolution-of-the-identity second-order Møller-Plesset perturbation theory execution time for single point energy calculation of linear alkanes. Furthermore, we obtain the correlation and total energy for n-octane conformers as the torsional angle of central bond is rotated to show that precision is not lost for these types of calculations. This code modification is accomplished using the NVIDIA CUDA Basic Linear Algebra Subprograms (CUBLAS) library for an NVIDIA Quadro FX 5600 graphics card. Finally, we anticipate further speedups of other matrix algebra based electronic structure calculations using a similar approach. (Accelerating Resolution-of-the-Identity Second-Order Møller-Plesset Quantum Chemistry Calculations with Graphical Processing Units. Vogt, L., Olivares-Amaya, R., Kermes, S., Shao, Y., Amador-Bedolla, C., and Aspuru-Guzik, A. J. Phys. Chem. A, 2008, DOI: 10.1021/jp0776762)
Posted: 10 Feb 2008 [GPGPU /Scientific Computing] # High-throughput sequence alignment using Graphics Processing Units The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. by University of Maryland researchers Michael Schatz, Cole Trapnell, Art Delcher, and Amitabh Varshney describes MUMmerGPU, an open-source high-throughput parallel pairwise local sequence alignment program that runs on GPUs. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, despite the very low arithmetic intensity of the task. (High-throughput sequence alignment using Graphics Processing Units, Schatz, M.C., Trapnell, C., Delcher, A.L., Varshney, A. (2007), BMC Bioinformatics 8:474.)
Posted: 10 Feb 2008 [GPGPU /Scientific Computing] # SIGGRAPH Poster: Extended-Precision Floating-Point Numbers for GPU Computation Using unevaluated sums of paired or quadrupled single-precision (f32) values,
double-float (df64) and quad-float (qf128) numeric types can be implemented
on current GPUs and used efficiently and effectively for extended-precision computation for real and complex arithmetic. These numeric types provide 48
and 96 bits of precision respectively at f32 exponent ranges for computer
raphics and general purpose (GPGPU) programming. Double- and quad-floats may
be useful not only for extending available precision but also for accurate computation by only partially IEEE compliant single-precision floats. The poster and demos presented at ACM SIGGRAPH 06 discussed the implementation and application of these numbers in the Cg language for real and complex GPU
programming. The df64 library includes math routines for exponential, log, and trigonometric functions. The poster can be downloaded from Andrew Thall's website. Technical details will be available shortly, and the code itself will be made
available for distribution given sufficient interest.
Posted: 24 Jan 2008 [GPGPU /Scientific Computing/Numerical Algorithms] # Abstract: "The widespread usage of the Discrete Wavelet Transform (DWT) has motivated the development of fast DWT algorithms and their tuning on all sorts of computer systems. Several studies have compared the performance of the most popular schemes, known as Filter Bank (FBS) and Lifting (LS), and have always concluded that Lifting is the most efficient option. However, there is no such study on streaming processors such as modern
Graphic Processing Units (GPUs). Current trends have transformed these devices into powerful stream processors with enough flexibility to perform intensive and complex floating-point calculations. The opportunities opened up by these platforms, as well as the growing popularity of the DWT within the computer graphics field, make a new performance comparison of great practical interest. Our study indicates that FBS outperforms LS in current generation GPUs. In our experiments, the actual FBS gains range between 10% and 140%, depending on the problem size and the type and length of the wavelet filter. Moreover, design trends suggest higher gains in future generation GPUs. (Parallel Implementation of the 2D Discrete Wavelet Transform on Graphics Processing Units: Filter Bank versus Lifting. Christian Tenllado, Javier Setoain, Manuel Prieto, Luis Piñuel, and Francisco Tirado. IEEE Transactions on Parallel and Distributed Systems ,vol. 19, no. 3, pp. 299-310, March, 2008. )
Posted: 24 Jan 2008 [GPGPU /Scientific Computing] # Toward efficient GPU-accelerated N-body simulations Abstract: "N-body algorithms are applicable to a number of common problems in computational physics including gravitation, electrostatics, and fluid dynamics. Fast algorithms (those with better than O(N2) performance) exist, but have not been successfully implemented on GPU hardware for practical problems. In the present work, we introduce not only best-in-class performance for a multipole-accelerated treecode method, but a series of improvements that support implementation of this solver on highly-data-parallel graphics processing units (GPUs). The greatly reduced computation times suggest that this problem is ideally suited for the current and next generations of single and cluster CPU-GPU architectures. We believe that this is an ideal method for practical computation of largescale turbulent flows on future supercomputing hardware using parallel vortex particle methods. (Mark J. Stock and Adrin Gharakhani, "Toward efficient GPU-accelerated N-body simulations," in 46th AIAA Aerospace Sciences Meeting and Exhibit, AIAA 2008-608, January 2008, Reno, Nevada.)
Posted: 17 Jan 2008 [GPGPU /Scientific Computing] # Acceleration of a 3D Euler Solver Using Commodity Graphics Hardware Abstract: "The porting of two- and three-dimensional Euler solvers from a conventional CPU implementation to the novel target platform of the Graphics Processing Unit (GPU) is described. The motivation for such an effort is the impressive performance that GPUs offer: typically 10 times more floating point operations per second than a modern CPU, with over 100 processing cores and all at a very modest financial cost. Both codes were found to generate the same results on the GPU as the FORTRAN versions did on the CPU. The 2D solver ran up to 29 times quicker on the GPU than on the CPU; the 3D solver 16 times faster." (Tobias Brandvik and Graham Pullan, Acceleration of a 3D Euler Solver Using Commodity Graphics Hardware. 46th AIAA Aerospace Sciences Meeting and Exhibit. January, 2008.)
Posted: 17 Jan 2008 [GPGPU /Scientific Computing] # Interactive Simulation of Large Scale Agent-Based Models (ABMs) on the GPU This article by D’Souza et al. explores large scale Agent-Based Model(ABM) simulation on the GPU. Agent-based modeling is a technique which has become increasingly popular for simulating complex natural phenomena such as swarms and biological cell colonies. An ABM describes a dynamic system by representing it as a collection of communicating, concurrent objects. Current ABM simulation toolkits and algorithms use discrete event simulation techniques and are executed serially on a CPU. This limits the size of the models that can be handled efficiently. In this paper we present a series of efficient data-parallel algorithms for simulating ABMs. These include methods for handling environment updates, agent interactions and replication. Important techniques presented in this work include a novel stochastic allocator which enables parallel agent replication in O(1) average time and an iterative method to handle collision among agents in the spatial domain. These techniques have been implemented on a modern GPU (GeForce 8800GTX), resulting in a substantial performance increase. The authors believe that their system is the first completely GPU-based ABM simulation framework. (D’Souza R., Lysenko, M., Rahmani, K., SugarScape on steroids: simulating over a million agents at interactive rates. Proceedings of the Agent2007 conference, Chicago, IL. 2007.)
Posted: 16 Jan 2008 [GPGPU /Scientific Computing/Dynamics Simulation] # Exploring weak scalability for FEM calculations on a GPU-enhanced cluster The first part of this paper by Goeddeke et al. surveys co-processor
approaches for commodity based clusters in general, not only with
respect to raw performance, but also in view of their system integration
and power consumption. We then extend previous work on a small GPU
cluster by exploring the heterogeneous hardware approach for a
large-scale system with up to 160 nodes. Starting with a conventional
commodity based cluster we leverage the high bandwidth of graphics
processing units (GPUs) to increase the overall system bandwidth that is
the decisive performance factor in this scenario. Thus, even the
addition of low-end, out of date GPUs leads to improvements in both
performance- and power-related metrics.
(Dominik Göddeke, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick McCormick, Sven H.M. Buijssen, Matthias Grajewski and Stefan Turek.
Exploring weak scalability for FEM calculations on a GPU-enhanced cluster. Parallel Computing 33:10-11. pp. 685-699. 2007.)
Posted: 15 Nov 2007 [GPGPU /Scientific Computing] # Using GPUs to Improve Multigrid Solver Performance on a Cluster This article by Goeddeke et al. explores the coupling of coarse and
fine-grained parallelism for Finite Element simulations based on
efficient parallel multigrid solvers. The focus lies on both system
performance and a minimally invasive integration of hardware
acceleration into an existing software package, requiring no changes to
application code. Because of their excellent price performance ratio, we
demonstrate the viability of our approach by using commodity graphics
processors (GPUs) as efficient multigrid preconditioners. We address the
issue of limited precision on GPUs by applying a mixed precision,
iterative refinement technique. Other restrictions are also handled by a
close interplay between the GPU and CPU. From a software perspective, we
integrate the GPU solvers into the existing MPI-based Finite Element
package by implementing the same interfaces as the CPU solvers, so that
for the application programmer they are easily interchangeable. Our
results show that we do not compromise any software functionality and
gain speedups of two and more for large problems. Equipped with this
additional option of hardware acceleration we compare different choices
in increasing the performance of a conventional, commodity based cluster
by increasing the number of nodes, replacement of nodes by a newer
technology generation, and adding powerful graphics cards to the
existing nodes.
(Dominik Göddeke, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick McCormick, Hilmar Wobker, Christian Becker and Stefan Turek.
Using GPUs to Improve Multigrid Solver Performance on a Cluster. Accepted for publication in the International Journal of Computational Science and Engineering.)
Posted: 15 Nov 2007 [GPGPU /Scientific Computing] # Toward Acceleration of RSA Using 3D Graphics Hardware This paper
by Moss et. al shows an implementation of multi-precision arithmetic running on a 7800-GTX. The paper shows how to compute the modular
exponentiation of large integers (a central operation in the RSA cryptosystem) using the restricted control flow available on a DX9
card. Both the background number theory used to express the problem in a suitable way for a streaming architecture, and the program
transformation techniques used to generate the GLSL code are described in detail. Surprisingly (given the unusual nature of the problem for GPGPU) the GPU is capable of out-performing the CPU over a large enough dataset by a factor of 2x-3x depending on the CPU implementation. Unfortunately the immature state of the GLSL compiler prevents a further 2x improvement by allocating too many registers, and the large latency for setting the problem up means that over 800 exponentiations need to be performed to break-even against the CPU.
(Andrew Moss, Dan Page and Nigel Smart. Toward Acceleration of RSA Using 3D Graphics Hardware. In: LNCS 4887, pages 369--388. Springer, December 2007.)
Posted: 14 Nov 2007 [GPGPU /Scientific Computing/Numerical Algorithms] # This paper by Anderson et al at Caltech describes a method to use GPUs to accelerate Quantum Monte Carlo on a GPU. QMC is among the most
accurate (and expensive) methods in the quantum chemistry zoo. Primarily, this involves the investigation of tricks available to this
algorithm to speed up matrix multiplication. That is, as a statistical algorithm, the authors studied the performance enhancements available when
multiplying many matrices simultaneously. Additionally, the paper explores the Kahan Summation Formula to improve the accuracy of GPU
matrix multiplication. (Quantum Monte Carlo on Graphical Processing Units. Amos G. Anderson, William A Goddard III, Peter Schroder. Computer Physics Communications)
Posted: 10 Sep 2007 [GPGPU /Scientific Computing] # |
Categories
|