GPGPU |
General-Purpose Computation Using Graphics Hardware
|
IntroductionGPGPU stands for General-Purpose computation on GPUs. With the increasing programmability of commodity graphics processing units (GPUs), these chips are capable of performing more than the specific graphics computations for which they were designed. They are now capable coprocessors, and their high speed makes them useful for a variety of applications. The goal of this page is to catalog the current and historical use of GPUs for general-purpose computation.
|
CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment The Smith-Waterman algorithm has been available for more than 25 years. It is based on a dynamic programming approach that explores all the possible alignments between two biological sequences; as a result it returns the optimal local alignment. Unfortunately, the computational cost is very high, requiring a number of operations proportional to the product of the length of two sequences. This paper by Svetlin Manavski and Giorgio Valle describes SmithWaterman-CUDA, an open-source project to perform fast sequence alignment on the GPU. Although the software performs the optimal Smith-Waterman alignment it is faster than heuristics approaches like FASTA and BLAST. The tests on protein data banks show up to 30x speed up related to reference CPU implementations. (Svetlin A. Manavski, Giorgio Valle, CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment, BMC Bioinformatics 2008, 9(Suppl 2):S10 (26 March 2008))
Posted: 02 Apr 2008 [GPGPU /Scientific Computing] # A SIMD interpreter for Genetic Programming on GPU Graphics Cards Abstract: Mackey-Glass chaotic time series prediction and nuclear protein classification show the feasibility of evaluating genetic programming populations directly on parallel consumer gaming graphics processing units. Using a Linux KDE computer equipped with an nVidia GeForce 8800 GTX graphics processing unit card the C++ SPMD interpretter evolves programs at Giga GP operations per second (895 million GPops). We use the RapidMind general processing on GPU (GPGPU) framework to evaluate an entire population of a quarter of a million individual programs on a non-trivial problem in 4 seconds. An efficient reverse polish notation (RPN) tree based GP is given. (A SIMD interpreter for Genetic Programming on GPU Graphics Cards. W.B. Langdon and W. Banzhaf. In M. Neill, L. Vanneschi, A.I. Esparcia Alcazar, S. Gustafson eds., EuroGP 2008, pp73-85. Springer, LNCS 4971, 26-28 March, Naples.)
Posted: 02 Apr 2008 [GPGPU /Scientific Computing] # Ivan Ufimtsev and Todd Martínez at the University of Illinois at Urbana-Champaign have implemented an efficient method of calculating two-electron repulsion integrals over Gaussian basis functions on the GPU. Virtually all modern quantum chemical calculations require evaluating millions to billions of these integrals. This problem turns out to be well-suited to the massively parallel architecture of GPUs by an appropriate partitioning of the problem. A benchmark test performed for the evaluation of approximately one million (ss|ss) integrals over contracted s-orbitals showed that a naïve algorithm implemented on the GPU achieves up to 130-fold speedup over a traditional CPU implementation on an AMD Opteron. Subsequent calculations on a 256-atom DNA strand show that the GPU advantage is maintained for basis sets including higher angular momentum functions.
(Quantum Chemistry on Graphical Processing Units. 1. Strategies for
Two-Electron Integral Evaluation, Ivan S. Ufimtsev and Todd J. Martínez, J. Chem. Theory Comput., 4 (2), 222 -231, 2008. doi:10.1021/ct700268q)
Posted: 01 Apr 2008 [GPGPU /Scientific Computing] # In this paper we describe a modification of a general purpose code for quantum mechanical calculations of molecular properties (Q-Chem) to use a graphical processing unit. We report a 4.3x speedup of the resolution-of-the-identity second-order Møller-Plesset perturbation theory execution time for single point energy calculation of linear alkanes. Furthermore, we obtain the correlation and total energy for n-octane conformers as the torsional angle of central bond is rotated to show that precision is not lost for these types of calculations. This code modification is accomplished using the NVIDIA CUDA Basic Linear Algebra Subprograms (CUBLAS) library for an NVIDIA Quadro FX 5600 graphics card. Finally, we anticipate further speedups of other matrix algebra based electronic structure calculations using a similar approach. (Accelerating Resolution-of-the-Identity Second-Order Møller-Plesset Quantum Chemistry Calculations with Graphical Processing Units. Vogt, L., Olivares-Amaya, R., Kermes, S., Shao, Y., Amador-Bedolla, C., and Aspuru-Guzik, A. J. Phys. Chem. A, 2008, DOI: 10.1021/jp0776762)
Posted: 10 Feb 2008 [GPGPU /Scientific Computing] # High-throughput sequence alignment using Graphics Processing Units The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. by University of Maryland researchers Michael Schatz, Cole Trapnell, Art Delcher, and Amitabh Varshney describes MUMmerGPU, an open-source high-throughput parallel pairwise local sequence alignment program that runs on GPUs. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, despite the very low arithmetic intensity of the task. (High-throughput sequence alignment using Graphics Processing Units, Schatz, M.C., Trapnell, C., Delcher, A.L., Varshney, A. (2007), BMC Bioinformatics 8:474.)
Posted: 10 Feb 2008 [GPGPU /Scientific Computing] # Abstract: "The widespread usage of the Discrete Wavelet Transform (DWT) has motivated the development of fast DWT algorithms and their tuning on all sorts of computer systems. Several studies have compared the performance of the most popular schemes, known as Filter Bank (FBS) and Lifting (LS), and have always concluded that Lifting is the most efficient option. However, there is no such study on streaming processors such as modern
Graphic Processing Units (GPUs). Current trends have transformed these devices into powerful stream processors with enough flexibility to perform intensive and complex floating-point calculations. The opportunities opened up by these platforms, as well as the growing popularity of the DWT within the computer graphics field, make a new performance comparison of great practical interest. Our study indicates that FBS outperforms LS in current generation GPUs. In our experiments, the actual FBS gains range between 10% and 140%, depending on the problem size and the type and length of the wavelet filter. Moreover, design trends suggest higher gains in future generation GPUs. (Parallel Implementation of the 2D Discrete Wavelet Transform on Graphics Processing Units: Filter Bank versus Lifting. Christian Tenllado, Javier Setoain, Manuel Prieto, Luis Piñuel, and Francisco Tirado. IEEE Transactions on Parallel and Distributed Systems ,vol. 19, no. 3, pp. 299-310, March, 2008. )
Posted: 24 Jan 2008 [GPGPU /Scientific Computing] # Toward efficient GPU-accelerated N-body simulations Abstract: "N-body algorithms are applicable to a number of common problems in computational physics including gravitation, electrostatics, and fluid dynamics. Fast algorithms (those with better than O(N2) performance) exist, but have not been successfully implemented on GPU hardware for practical problems. In the present work, we introduce not only best-in-class performance for a multipole-accelerated treecode method, but a series of improvements that support implementation of this solver on highly-data-parallel graphics processing units (GPUs). The greatly reduced computation times suggest that this problem is ideally suited for the current and next generations of single and cluster CPU-GPU architectures. We believe that this is an ideal method for practical computation of largescale turbulent flows on future supercomputing hardware using parallel vortex particle methods. (Mark J. Stock and Adrin Gharakhani, "Toward efficient GPU-accelerated N-body simulations," in 46th AIAA Aerospace Sciences Meeting and Exhibit, AIAA 2008-608, January 2008, Reno, Nevada.)
Posted: 17 Jan 2008 [GPGPU /Scientific Computing] # Acceleration of a 3D Euler Solver Using Commodity Graphics Hardware Abstract: "The porting of two- and three-dimensional Euler solvers from a conventional CPU implementation to the novel target platform of the Graphics Processing Unit (GPU) is described. The motivation for such an effort is the impressive performance that GPUs offer: typically 10 times more floating point operations per second than a modern CPU, with over 100 processing cores and all at a very modest financial cost. Both codes were found to generate the same results on the GPU as the FORTRAN versions did on the CPU. The 2D solver ran up to 29 times quicker on the GPU than on the CPU; the 3D solver 16 times faster." (Tobias Brandvik and Graham Pullan, Acceleration of a 3D Euler Solver Using Commodity Graphics Hardware. 46th AIAA Aerospace Sciences Meeting and Exhibit. January, 2008.)
Posted: 17 Jan 2008 [GPGPU /Scientific Computing] # Interactive Simulation of Large Scale Agent-Based Models (ABMs) on the GPU This article by D’Souza et al. explores large scale Agent-Based Model(ABM) simulation on the GPU. Agent-based modeling is a technique which has become increasingly popular for simulating complex natural phenomena such as swarms and biological cell colonies. An ABM describes a dynamic system by representing it as a collection of communicating, concurrent objects. Current ABM simulation toolkits and algorithms use discrete event simulation techniques and are executed serially on a CPU. This limits the size of the models that can be handled efficiently. In this paper we present a series of efficient data-parallel algorithms for simulating ABMs. These include methods for handling environment updates, agent interactions and replication. Important techniques presented in this work include a novel stochastic allocator which enables parallel agent replication in O(1) average time and an iterative method to handle collision among agents in the spatial domain. These techniques have been implemented on a modern GPU (GeForce 8800GTX), resulting in a substantial performance increase. The authors believe that their system is the first completely GPU-based ABM simulation framework. (D’Souza R., Lysenko, M., Rahmani, K., SugarScape on steroids: simulating over a million agents at interactive rates. Proceedings of the Agent2007 conference, Chicago, IL. 2007.)
Posted: 16 Jan 2008 [GPGPU /Scientific Computing/Dynamics Simulation] # Exploring weak scalability for FEM calculations on a GPU-enhanced cluster The first part of this paper by Goeddeke et al. surveys co-processor
approaches for commodity based clusters in general, not only with
respect to raw performance, but also in view of their system integration
and power consumption. We then extend previous work on a small GPU
cluster by exploring the heterogeneous hardware approach for a
large-scale system with up to 160 nodes. Starting with a conventional
commodity based cluster we leverage the high bandwidth of graphics
processing units (GPUs) to increase the overall system bandwidth that is
the decisive performance factor in this scenario. Thus, even the
addition of low-end, out of date GPUs leads to improvements in both
performance- and power-related metrics.
(Dominik Göddeke, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick McCormick, Sven H.M. Buijssen, Matthias Grajewski and Stefan Turek.
Exploring weak scalability for FEM calculations on a GPU-enhanced cluster. Parallel Computing 33:10-11. pp. 685-699. 2007.)
Posted: 15 Nov 2007 [GPGPU /Scientific Computing] # Using GPUs to Improve Multigrid Solver Performance on a Cluster This article by Goeddeke et al. explores the coupling of coarse and
fine-grained parallelism for Finite Element simulations based on
efficient parallel multigrid solvers. The focus lies on both system
performance and a minimally invasive integration of hardware
acceleration into an existing software package, requiring no changes to
application code. Because of their excellent price performance ratio, we
demonstrate the viability of our approach by using commodity graphics
processors (GPUs) as efficient multigrid preconditioners. We address the
issue of limited precision on GPUs by applying a mixed precision,
iterative refinement technique. Other restrictions are also handled by a
close interplay between the GPU and CPU. From a software perspective, we
integrate the GPU solvers into the existing MPI-based Finite Element
package by implementing the same interfaces as the CPU solvers, so that
for the application programmer they are easily interchangeable. Our
results show that we do not compromise any software functionality and
gain speedups of two and more for large problems. Equipped with this
additional option of hardware acceleration we compare different choices
in increasing the performance of a conventional, commodity based cluster
by increasing the number of nodes, replacement of nodes by a newer
technology generation, and adding powerful graphics cards to the
existing nodes.
(Dominik Göddeke, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick McCormick, Hilmar Wobker, Christian Becker and Stefan Turek.
Using GPUs to Improve Multigrid Solver Performance on a Cluster. Accepted for publication in the International Journal of Computational Science and Engineering.)
Posted: 15 Nov 2007 [GPGPU /Scientific Computing] # Toward Acceleration of RSA Using 3D Graphics Hardware This paper
by Moss et. al shows an implementation of multi-precision arithmetic running on a 7800-GTX. The paper shows how to compute the modular
exponentiation of large integers (a central operation in the RSA cryptosystem) using the restricted control flow available on a DX9
card. Both the background number theory used to express the problem in a suitable way for a streaming architecture, and the program
transformation techniques used to generate the GLSL code are described in detail. Surprisingly (given the unusual nature of the problem for GPGPU) the GPU is capable of out-performing the CPU over a large enough dataset by a factor of 2x-3x depending on the CPU implementation. Unfortunately the immature state of the GLSL compiler prevents a further 2x improvement by allocating too many registers, and the large latency for setting the problem up means that over 800 exponentiations need to be performed to break-even against the CPU.
(Andrew Moss, Dan Page and Nigel Smart. Toward Acceleration of RSA Using 3D Graphics Hardware. In: LNCS 4887, pages 369--388. Springer, December 2007.)
Posted: 05 Nov 2007 [GPGPU /Scientific Computing/Numerical Algorithms] # This paper by Anderson et al at Caltech describes a method to use GPUs to accelerate Quantum Monte Carlo on a GPU. QMC is among the most
accurate (and expensive) methods in the quantum chemistry zoo. Primarily, this involves the investigation of tricks available to this
algorithm to speed up matrix multiplication. That is, as a statistical algorithm, the authors studied the performance enhancements available when
multiplying many matrices simultaneously. Additionally, the paper explores the Kahan Summation Formula to improve the accuracy of GPU
matrix multiplication. (Quantum Monte Carlo on Graphical Processing Units. Amos G. Anderson, William A Goddard III, Peter Schroder. Computer Physics Communications)
Posted: 10 Sep 2007 [GPGPU /Scientific Computing] # Graphic processors to speed-up simulations for the design of high performance solar receptors This paper by Collange et
al. at Université de Perpignan, France, decribes a prototype to be
integrated into simulation codes that estimate temperature, velocity and
pressure to design next generation solar receptors. Such codes delegate to
GPUs the computation of heat transfer due to radiation. The authors use
Monte-Carlo line-by-line ray-tracing through finite volumes. This means
data-parallel arithmetic transformations on large data structures. The
performance on two recent graphics cards (Nvidia 7800GTX and ATI RX1800XL)
show speedups higher than 400 compared to CPU implementations leaving
most of CPU computing resources available. As there were some questions
pending about the accuracy of the operators implemented in GPUs, the authors
start this report with a survey and some contributed tests on the various floating
point units available on GPUs. (Graphic
processors to speed-up simulations for the design of high performance solar
receptors. S. Collange, M. Daumas, D. Defour. Proceedings of the IEEE
18th International Conference on Application-specific Systems, Architectures
and Processors.)
Posted: 04 Sep 2007 [GPGPU /Scientific Computing] # Two-electron Integral Evaluation on the Graphics Processor Unit Abstract: We propose the algorithm to evaluate the Coulomb potential in the ab initio density functional calculation on the graphics processor unit (GPU). The numerical accuracy required for the algorithm is investigated in detail. It is shown that GPU, which supports only the single-precision floating number natively, can take part in the major computational tasks. Because of the limited size of the working memory, the Gauss-Rys quadrature to evaluate the electron repulsion integrals (ERIs) is investigated in detail. The error analysis of the quadrature is performed. New interpolation formula of the roots and weights is presented, which is suitable for the processor of the single-instruction multiple-data type. It is proposed to calculate only small ERIs on GPU. ERIs can be classified efficiently with the upper-bound formula. The algorithm is implemented on NVIDIA GeForce 8800 GTX and the Gaussian 03 program suite. It is applied to the test molecules Taxol and Valinomycin. The total energies calculated are essentially the same as the reference ones. The preliminary results show the considerable speedup over the commodity microprocessor. (Two-electron integral evaluation on the graphics processor unit. Koji Yasuda. Journal of Computational Chemistry. July 5, 2007.)
Posted: 16 Aug 2007 [GPGPU /Scientific Computing] # Accelerating molecular modeling applications with graphics processors In this paper, an overview of recent advances in programmable GPUs
is presented, with an emphasis on their application to molecular
mechanics simulations and the programming techniques required to obtain
optimal performance in these cases. We demonstrate the use of GPUs
for the calculation of long-range electrostatics and nonbonded forces
for molecular dynamics simulations. The application of GPU acceleration
to biomolecular simulation is also demonstrated through the use of
GPU-accelerated Coulomb-based ion placement and calculation of time-averaged
potentials from molecular dynamics trajectories. A novel approximation to
Coulomb potential calculation, the multilevel summation method, is introduced
and compared to direct Coulomb summation. In light of the performance
obtained for this set of calculations, future applications of graphics
processors to molecular dynamics simulations are discussed.
(
Accelerating molecular modeling applications with graphics processors
, John E. Stone, James C. Phillips, Peter L. Freddolino, David J. Hardy,
Leonardo G. Trabuco, and Klaus Schulten.
Journal of Computational Chemistry (In press))
Posted: 10 Aug 2007 [GPGPU /Scientific Computing] # Graphic-Card Cluster for Astrophysics (GraCCA) -- Performance Tests Abstract: "In this paper, we describe the architecture and performance
of the GraCCA system, a Graphic-Card Cluster for Astrophysics simulations.
It consists of 16 nodes, with each node equipped with 2 modern graphic cards,
the NVIDIA GeForce 8800 GTX. This computing cluster provides a theoretical
performance of 16.2 TFLOPS. To demonstrate its performance in astrophysics
computation, we have implemented a parallel direct N-body simulation program
with shared time-step algorithm in this system. Our system achieves a measured
performance of 7.1 TFLOPS and a parallel efficiency of 90% for simulating a
globular cluster of 1024K particles. In comparing with the GRAPE-6A cluster at
RIT (Rochester Institute of Technology), the GraCCA system achieves a more than
twice higher measured speed and an even higher performance-per-dollar ratio.
Moreover, our system can handle up to 320M particles and can serve as a
general-purpose computing cluster for a wide range of astrophysics problems.
(Hsi-Yu Schive, Chia-Hung Chien, Shing-Kwong Wong, Yu-Chih Tsai, Tzihong Chiueh.
Graphic-Card Cluster for Astrophysics (GraCCA) -- Performance Tests.
submitted to New Astronomy, 20 July, 2007.)
Posted: 27 Jul 2007 [GPGPU /Scientific Computing] # Abstract: "We present the results of gravitational direct N-body simulations using the Graphics Processing Unit (GPU)
on a commercial NVIDIA GeForce 8800GTX designed for gaming computers. The force evaluation of the N-body problem is
implemented in "Compute Unified Device Architecture" (CUDA) using the GPU to speed-up the calculations. We tested the
implementation on three different N-body codes: two direct N-body integration codes, using the 4th order
predictor-corrector Hermite integrator with block time-steps, and one Barnes-Hut treecode, which uses a 2nd order
leapfrog integration scheme. The integration of the equations of motions for all codes is performed on the host CPU.
We find that for N > 512 particles the GPU outperforms the GRAPE-6Af, if some softening in the force calculation is
accepted. Without softening and for very small integration time steps the GRAPE still outperforms the GPU. We conclude
that modern GPUs offer an attractive alternative to GRAPE-6Af special purpose hardware. Using the same time-step
criterion, the total energy of the N-body system was conserved better than to one in 10^6 on the GPU, only about an
order of magnitude worse than obtained with GRAPE-6Af. For N \apgt 10^5 the 8800GTX outperforms the host CPU by a factor
of about 100 and runs at about the same speed as the GRAPE-6Af."
(Robert G. Belleman, Jeroen Bedorf, Simon Portegies Zwart. High Performance
Direct Gravitational N-body Simulations on Graphics Processing Units -- II: An implementation in CUDA.
Accepted for publication in New Astronomy.)
Posted: 27 Jul 2007 [GPGPU /Scientific Computing] # A Fast Implementation of the Octagon Abstract Domain on Graphics Hardware This paper by Banterle
and Giacobazzi at Università degli Studi di Verona presents an efficient
implementation of the Octagon Abstract Domain (OAD) on graphics hardware.
OAD is a relational numerical abstract domain which approximates
invariants as conjuctions of constraints of the form +/- x +/- y <= c,
where x and y are program variables and c is a constant which can be an
integer, rational or real. OAD has been used with success in the aerospace
industry for analyzing C programs such as the flight control software for
the Airbus A340 fly-by-wire system.
(
A Fast Implementation of the Octagon Abstract Domain on Graphics Hardware.
Francesco Banterle and Roberto Giacobazzi. Proceeding of The 14th
International Static Analysis Symposium (SAS). 2007)
Posted: 14 Jul 2007 [GPGPU /Scientific Computing/Mathematics] # Lattice QCD as a video game (GPGPU for quantum field theory) This paper outlines how GPGPU techniques can be used for Monte
Carlo simulations of quantum field theories such as QCD. The speedup
is around a factor of 4-10 depending on the GPU model relative to
SSE optimized code on a Pentium 4. Sample code is also given.
(Lattice QCD as a video game)
Posted: 14 Jul 2007 [GPGPU /Scientific Computing] # |
Categories
GPGPU PeopleFor a list of people doing GPGPU work, See the GPGPU wiki |