GPGPU |
General-Purpose Computation Using Graphics Hardware
|
IntroductionGPGPU stands for General-Purpose computation on GPUs. With the increasing programmability of commodity graphics processing units (GPUs), these chips are capable of performing more than the specific graphics computations for which they were designed. They are now capable coprocessors, and their high speed makes them useful for a variety of applications. The goal of this page is to catalog the current and historical use of GPUs for general-purpose computation.
|
Abstract:
In a previous publication, we have examined the fundamental difference between computational precision and result accuracy in the context of the iterative solution of linear systems as they typically arise in the Finite Element discretization of Partial Differential Equations (PDEs). In particular, we evaluated mixed- and emulated-precision schemes on commodity graphics processors (GPUs), which at that time only supported computations in single precision. With the advent of graphics cards that natively provide double precision, this report updates our previous results. We demonstrate that with new co-processor hardware supporting native double precision, such as NVIDIA's G200 and T10 architectures, the situation does not change qualitatively for PDEs, and the previously introduced mixed precision schemes are still preferable to double precision alone. But the schemes achieve significant quantitative performance improvements with the more powerful hardware. In particular, we demonstrate that a Multigrid scheme can accurately solve a common test problem in Finite Element settings with one million unknowns in less than 0.1 seconds, which is truely outstanding performance. We support these conclusions by exploring the algorithmic design space enlarged by the availability of double precision directly in the hardware. (Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations (Part 2: Double Precision GPUs). Dominik Göddeke and Robert Strzodka. Technical Report, 2008.)Posted: 14 Jul 2008 [GPGPU /Scientific Computing] # PRACE award presented to young scientist at ISC’08 for GPGPU work From this article: "PRACE, Partnership for Advanced Computing in Europe, awarded a prize for the best scientific paper submitted to ISC’08 by a European student or young scientist on petascaling. The authors of the award winning paper are Stefan Turek, Dominik Göddeke, Christian Becker, Sven H.M. Buijssen and Hilmar Wobker from the Institute of Applied Mathematics, Dortmund University of Technology, Germany. Their work, UCHPC – UnConventional High Performance Computing for Finite Element Simulations, was selected by the ISC’08 Award Committee, headed by Michael Resch, High Performance Computing Center Stuttgart. Achim Bachem, Chairman of the Board Forschungszentrum Jülich and PRACE coordinator presented the PRACE Award at the ISC’08 opening ceremony in Dresden on Wednesday, 18 June. Dominik Göddeke, Ph.D. student in the team of Professor Stefan Turek will receive a sponsorship for the participation in a conference relevant to Petascale computing." Dominik has been an active GPGPU researcher for several years, and is one of the most active and helpful contributors to the GPGPU.org forums.
(PRACE award presented to young scientist at ISC’08)
Posted: 20 Jun 2008 [GPGPU /Scientific Computing] # Co-Processor Acceleration of an Unmodified Parallel Solid Mechanics Code with FEASTGPU FEAST is a hardware-oriented MPI-based Finite Element solver toolkit. With the extension FEASTGPU the authors have previously demonstrated that significant speed-ups in the solution of the scalar Poisson problem can be achieved by the addition of GPUs as scientific co-processors to a commodity based cluster. In this paper the authors put the more general claim to the test: Applications based on FEAST, that ran only on CPUs so far, can be successfully accelerated on a co-processor enhanced cluster without any code modifications. The chosen solid mechanics code has higher accuracy requirements and a more diverse CPU/co-processor interaction than the Poisson example, and is thus better suited to assess the practicability of the acceleration approach. The paper presents accuracy experiments, a scalability test and acceleration results for different elastic objects under load. In particular, it demonstrates in detail that the single precision execution of the co-processor does not affect the final accuracy. The paper establishes how the local acceleration gains of factors 5.5 to 9.0 translate into 1.6- to 2.6-fold total speed-up. Subsequent analysis reveals which measures will increase these factors further. (Dominik Göddeke, Hilmar Wobker, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick McCormick, Stefan Turek. Co-Processor Acceleration of an Unmodified Parallel Solid Mechanics Code with FEASTGPU. International Journal of Computational Science and Engineering (to appear).)
Posted: 06 Jun 2008 [GPGPU /Scientific Computing] # GPU acceleration of cutoff pair potentials for molecular modeling applications The advent of systems biology requires the simulation of ever-larger biomolecular systems, demanding a commensurate growth in computational power. This paper examines the use of the NVIDIA Tesla C870 graphics card programmed through the CUDA toolkit to accelerate the calculation of cutoff pair potentials, one of the most prevalent computations required by many different molecular modeling applications. The paper presents algorithms to calculate electrostatic potential maps for cutoff pair potentials. Whereas a straightforward approach for decomposing atom data leads to low computational efficiency, a new strategy enables fine-grained spatial decomposition of atom data that maps efficiently to the C870's memory system while increasing work efficiency of atom data traversal by a factor of 5. The memory addressing flexibility exposed through CUDA's SPMD programming model is crucial in enabling this new strategy. An implementation of the new algorithm provides a greater than threefold performance improvement over our previously published implementation and runs 12 to 20 times faster than optimized CPU-only code. The lessons learned are generally applicable to algorithms accelerated by uniform grid spatial decomposition. (C. I. Rodrigues, D. J. Hardy, J. E. Stone, K. Schulten, W. W. Hwu.,
GPU acceleration of cutoff pair potentials for molecular modeling applications. Proceedings of the 2008 Conference On Computing Frontiers, pp.273-282, 2008.) (http://www.ks.uiuc.edu/Research/gpu/)
Posted: 25 May 2008 [GPGPU /Scientific Computing] # Abstract: "The graphics processing unit (GPU) has become an integral part of today's mainstream computing systems. Over the past six years, there has been a marked increase in the performance and capabilities of GPUs. The modern GPU is not only a powerful graphics engine but also a highly parallel programmable processor featuring peak arithmetic and memory andwidth that substantially outpaces its CPU counterpart. The GPU's rapid increase in both programmability and capability has spawned a research community that has successfully mapped a broad range of computationally demanding, complex problems to the GPU. This effort in general-purpose computing on the GPU, also known as GPU computing, has positioned the GPU as a compelling alternative to traditional microprocessors in high-performance computer systems of the future. We describe the background, hardware, and programming model for GPU computing, summarize the state of the art in tools and techniques, and present four GPU computing successes in game physics and computational biophysics that deliver order-of-magnitude performance gains over optimized CPU applications. (J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, J. C. Phillips, "GPU Computing", Proceedings of the IEEE, vol.96, no.5, pp.879-899, May 2008)
Posted: 25 May 2008 [GPGPU /Scientific Computing] # CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment The Smith-Waterman algorithm has been available for more than 25 years. It is based on a dynamic programming approach that explores all the possible alignments between two biological sequences; as a result it returns the optimal local alignment. Unfortunately, the computational cost is very high, requiring a number of operations proportional to the product of the length of two sequences. This paper by Svetlin Manavski and Giorgio Valle describes SmithWaterman-CUDA, an open-source project to perform fast sequence alignment on the GPU. Although the software performs the optimal Smith-Waterman alignment it is faster than heuristics approaches like FASTA and BLAST. The tests on protein data banks show up to 30x speed up related to reference CPU implementations. (Svetlin A. Manavski, Giorgio Valle, CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment, BMC Bioinformatics 2008, 9(Suppl 2):S10 (26 March 2008))
Posted: 02 Apr 2008 [GPGPU /Scientific Computing] # A SIMD interpreter for Genetic Programming on GPU Graphics Cards Abstract: Mackey-Glass chaotic time series prediction and nuclear protein classification show the feasibility of evaluating genetic programming populations directly on parallel consumer gaming graphics processing units. Using a Linux KDE computer equipped with an nVidia GeForce 8800 GTX graphics processing unit card the C++ SPMD interpretter evolves programs at Giga GP operations per second (895 million GPops). We use the RapidMind general processing on GPU (GPGPU) framework to evaluate an entire population of a quarter of a million individual programs on a non-trivial problem in 4 seconds. An efficient reverse polish notation (RPN) tree based GP is given. (A SIMD interpreter for Genetic Programming on GPU Graphics Cards. W.B. Langdon and W. Banzhaf. In M. Neill, L. Vanneschi, A.I. Esparcia Alcazar, S. Gustafson eds., EuroGP 2008, pp73-85. Springer, LNCS 4971, 26-28 March, Naples.)
Posted: 02 Apr 2008 [GPGPU /Scientific Computing] # Ivan Ufimtsev and Todd Martínez at the University of Illinois at Urbana-Champaign have implemented an efficient method of calculating two-electron repulsion integrals over Gaussian basis functions on the GPU. Virtually all modern quantum chemical calculations require evaluating millions to billions of these integrals. This problem turns out to be well-suited to the massively parallel architecture of GPUs by an appropriate partitioning of the problem. A benchmark test performed for the evaluation of approximately one million (ss|ss) integrals over contracted s-orbitals showed that a naďve algorithm implemented on the GPU achieves up to 130-fold speedup over a traditional CPU implementation on an AMD Opteron. Subsequent calculations on a 256-atom DNA strand show that the GPU advantage is maintained for basis sets including higher angular momentum functions.
(Quantum Chemistry on Graphical Processing Units. 1. Strategies for
Two-Electron Integral Evaluation, Ivan S. Ufimtsev and Todd J. Martínez, J. Chem. Theory Comput., 4 (2), 222 -231, 2008. doi:10.1021/ct700268q)
Posted: 01 Apr 2008 [GPGPU /Scientific Computing] # In this paper we describe a modification of a general purpose code for quantum mechanical calculations of molecular properties (Q-Chem) to use a graphical processing unit. We report a 4.3x speedup of the resolution-of-the-identity second-order Mřller-Plesset perturbation theory execution time for single point energy calculation of linear alkanes. Furthermore, we obtain the correlation and total energy for n-octane conformers as the torsional angle of central bond is rotated to show that precision is not lost for these types of calculations. This code modification is accomplished using the NVIDIA CUDA Basic Linear Algebra Subprograms (CUBLAS) library for an NVIDIA Quadro FX 5600 graphics card. Finally, we anticipate further speedups of other matrix algebra based electronic structure calculations using a similar approach. (Accelerating Resolution-of-the-Identity Second-Order Mřller-Plesset Quantum Chemistry Calculations with Graphical Processing Units. Vogt, L., Olivares-Amaya, R., Kermes, S., Shao, Y., Amador-Bedolla, C., and Aspuru-Guzik, A. J. Phys. Chem. A, 2008, DOI: 10.1021/jp0776762)
Posted: 10 Feb 2008 [GPGPU /Scientific Computing] # High-throughput sequence alignment using Graphics Processing Units The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. by University of Maryland researchers Michael Schatz, Cole Trapnell, Art Delcher, and Amitabh Varshney describes MUMmerGPU, an open-source high-throughput parallel pairwise local sequence alignment program that runs on GPUs. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, despite the very low arithmetic intensity of the task. (High-throughput sequence alignment using Graphics Processing Units, Schatz, M.C., Trapnell, C., Delcher, A.L., Varshney, A. (2007), BMC Bioinformatics 8:474.)
Posted: 10 Feb 2008 [GPGPU /Scientific Computing] # Abstract: "The widespread usage of the Discrete Wavelet Transform (DWT) has motivated the development of fast DWT algorithms and their tuning on all sorts of computer systems. Several studies have compared the performance of the most popular schemes, known as Filter Bank (FBS) and Lifting (LS), and have always concluded that Lifting is the most efficient option. However, there is no such study on streaming processors such as modern
Graphic Processing Units (GPUs). Current trends have transformed these devices into powerful stream processors with enough flexibility to perform intensive and complex floating-point calculations. The opportunities opened up by these platforms, as well as the growing popularity of the DWT within the computer graphics field, make a new performance comparison of great practical interest. Our study indicates that FBS outperforms LS in current generation GPUs. In our experiments, the actual FBS gains range between 10% and 140%, depending on the problem size and the type and length of the wavelet filter. Moreover, design trends suggest higher gains in future generation GPUs. (Parallel Implementation of the 2D Discrete Wavelet Transform on Graphics Processing Units: Filter Bank versus Lifting. Christian Tenllado, Javier Setoain, Manuel Prieto, Luis Pińuel, and Francisco Tirado. IEEE Transactions on Parallel and Distributed Systems ,vol. 19, no. 3, pp. 299-310, March, 2008. )
Posted: 24 Jan 2008 [GPGPU /Scientific Computing] # Toward efficient GPU-accelerated N-body simulations Abstract: "N-body algorithms are applicable to a number of common problems in computational physics including gravitation, electrostatics, and fluid dynamics. Fast algorithms (those with better than O(N2) performance) exist, but have not been successfully implemented on GPU hardware for practical problems. In the present work, we introduce not only best-in-class performance for a multipole-accelerated treecode method, but a series of improvements that support implementation of this solver on highly-data-parallel graphics processing units (GPUs). The greatly reduced computation times suggest that this problem is ideally suited for the current and next generations of single and cluster CPU-GPU architectures. We believe that this is an ideal method for practical computation of largescale turbulent flows on future supercomputing hardware using parallel vortex particle methods. (Mark J. Stock and Adrin Gharakhani, "Toward efficient GPU-accelerated N-body simulations," in 46th AIAA Aerospace Sciences Meeting and Exhibit, AIAA 2008-608, January 2008, Reno, Nevada.)
Posted: 17 Jan 2008 [GPGPU /Scientific Computing] # Acceleration of a 3D Euler Solver Using Commodity Graphics Hardware Abstract: "The porting of two- and three-dimensional Euler solvers from a conventional CPU implementation to the novel target platform of the Graphics Processing Unit (GPU) is described. The motivation for such an effort is the impressive performance that GPUs offer: typically 10 times more floating point operations per second than a modern CPU, with over 100 processing cores and all at a very modest financial cost. Both codes were found to generate the same results on the GPU as the FORTRAN versions did on the CPU. The 2D solver ran up to 29 times quicker on the GPU than on the CPU; the 3D solver 16 times faster." (Tobias Brandvik and Graham Pullan, Acceleration of a 3D Euler Solver Using Commodity Graphics Hardware. 46th AIAA Aerospace Sciences Meeting and Exhibit. January, 2008.)
Posted: 17 Jan 2008 [GPGPU /Scientific Computing] # Interactive Simulation of Large Scale Agent-Based Models (ABMs) on the GPU This article by D’Souza et al. explores large scale Agent-Based Model(ABM) simulation on the GPU. Agent-based modeling is a technique which has become increasingly popular for simulating complex natural phenomena such as swarms and biological cell colonies. An ABM describes a dynamic system by representing it as a collection of communicating, concurrent objects. Current ABM simulation toolkits and algorithms use discrete event simulation techniques and are executed serially on a CPU. This limits the size of the models that can be handled efficiently. In this paper we present a series of efficient data-parallel algorithms for simulating ABMs. These include methods for handling environment updates, agent interactions and replication. Important techniques presented in this work include a novel stochastic allocator which enables parallel agent replication in O(1) average time and an iterative method to handle collision among agents in the spatial domain. These techniques have been implemented on a modern GPU (GeForce 8800GTX), resulting in a substantial performance increase. The authors believe that their system is the first completely GPU-based ABM simulation framework. (D’Souza R., Lysenko, M., Rahmani, K., SugarScape on steroids: simulating over a million agents at interactive rates. Proceedings of the Agent2007 conference, Chicago, IL. 2007.)
Posted: 16 Jan 2008 [GPGPU /Scientific Computing/Dynamics Simulation] # Exploring weak scalability for FEM calculations on a GPU-enhanced cluster The first part of this paper by Goeddeke et al. surveys co-processor
approaches for commodity based clusters in general, not only with
respect to raw performance, but also in view of their system integration
and power consumption. We then extend previous work on a small GPU
cluster by exploring the heterogeneous hardware approach for a
large-scale system with up to 160 nodes. Starting with a conventional
commodity based cluster we leverage the high bandwidth of graphics
processing units (GPUs) to increase the overall system bandwidth that is
the decisive performance factor in this scenario. Thus, even the
addition of low-end, out of date GPUs leads to improvements in both
performance- and power-related metrics.
(Dominik Göddeke, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick McCormick, Sven H.M. Buijssen, Matthias Grajewski and Stefan Turek.
Exploring weak scalability for FEM calculations on a GPU-enhanced cluster. Parallel Computing 33:10-11. pp. 685-699. 2007.)
Posted: 15 Nov 2007 [GPGPU /Scientific Computing] # Using GPUs to Improve Multigrid Solver Performance on a Cluster This article by Goeddeke et al. explores the coupling of coarse and
fine-grained parallelism for Finite Element simulations based on
efficient parallel multigrid solvers. The focus lies on both system
performance and a minimally invasive integration of hardware
acceleration into an existing software package, requiring no changes to
application code. Because of their excellent price performance ratio, we
demonstrate the viability of our approach by using commodity graphics
processors (GPUs) as efficient multigrid preconditioners. We address the
issue of limited precision on GPUs by applying a mixed precision,
iterative refinement technique. Other restrictions are also handled by a
close interplay between the GPU and CPU. From a software perspective, we
integrate the GPU solvers into the existing MPI-based Finite Element
package by implementing the same interfaces as the CPU solvers, so that
for the application programmer they are easily interchangeable. Our
results show that we do not compromise any software functionality and
gain speedups of two and more for large problems. Equipped with this
additional option of hardware acceleration we compare different choices
in increasing the performance of a conventional, commodity based cluster
by increasing the number of nodes, replacement of nodes by a newer
technology generation, and adding powerful graphics cards to the
existing nodes.
(Dominik Göddeke, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick McCormick, Hilmar Wobker, Christian Becker and Stefan Turek.
Using GPUs to Improve Multigrid Solver Performance on a Cluster. Accepted for publication in the International Journal of Computational Science and Engineering.)
Posted: 15 Nov 2007 [GPGPU /Scientific Computing] # Toward Acceleration of RSA Using 3D Graphics Hardware This paper
by Moss et. al shows an implementation of multi-precision arithmetic running on a 7800-GTX. The paper shows how to compute the modular
exponentiation of large integers (a central operation in the RSA cryptosystem) using the restricted control flow available on a DX9
card. Both the background number theory used to express the problem in a suitable way for a streaming architecture, and the program
transformation techniques used to generate the GLSL code are described in detail. Surprisingly (given the unusual nature of the problem for GPGPU) the GPU is capable of out-performing the CPU over a large enough dataset by a factor of 2x-3x depending on the CPU implementation. Unfortunately the immature state of the GLSL compiler prevents a further 2x improvement by allocating too many registers, and the large latency for setting the problem up means that over 800 exponentiations need to be performed to break-even against the CPU.
(Andrew Moss, Dan Page and Nigel Smart. Toward Acceleration of RSA Using 3D Graphics Hardware. In: LNCS 4887, pages 369--388. Springer, December 2007.)
Posted: 05 Nov 2007 [GPGPU /Scientific Computing/Numerical Algorithms] # This paper by Anderson et al at Caltech describes a method to use GPUs to accelerate Quantum Monte Carlo on a GPU. QMC is among the most
accurate (and expensive) methods in the quantum chemistry zoo. Primarily, this involves the investigation of tricks available to this
algorithm to speed up matrix multiplication. That is, as a statistical algorithm, the authors studied the performance enhancements available when
multiplying many matrices simultaneously. Additionally, the paper explores the Kahan Summation Formula to improve the accuracy of GPU
matrix multiplication. (Quantum Monte Carlo on Graphical Processing Units. Amos G. Anderson, William A Goddard III, Peter Schroder. Computer Physics Communications)
Posted: 10 Sep 2007 [GPGPU /Scientific Computing] # Graphic processors to speed-up simulations for the design of high performance solar receptors This paper by Collange et
al. at Université de Perpignan, France, decribes a prototype to be
integrated into simulation codes that estimate temperature, velocity and
pressure to design next generation solar receptors. Such codes delegate to
GPUs the computation of heat transfer due to radiation. The authors use
Monte-Carlo line-by-line ray-tracing through finite volumes. This means
data-parallel arithmetic transformations on large data structures. The
performance on two recent graphics cards (Nvidia 7800GTX and ATI RX1800XL)
show speedups higher than 400 compared to CPU implementations leaving
most of CPU computing resources available. As there were some questions
pending about the accuracy of the operators implemented in GPUs, the authors
start this report with a survey and some contributed tests on the various floating
point units available on GPUs. (Graphic
processors to speed-up simulations for the design of high performance solar
receptors. S. Collange, M. Daumas, D. Defour. Proceedings of the IEEE
18th International Conference on Application-specific Systems, Architectures
and Processors.)
Posted: 04 Sep 2007 [GPGPU /Scientific Computing] # Two-electron Integral Evaluation on the Graphics Processor Unit Abstract: We propose the algorithm to evaluate the Coulomb potential in the ab initio density functional calculation on the graphics processor unit (GPU). The numerical accuracy required for the algorithm is investigated in detail. It is shown that GPU, which supports only the single-precision floating number natively, can take part in the major computational tasks. Because of the limited size of the working memory, the Gauss-Rys quadrature to evaluate the electron repulsion integrals (ERIs) is investigated in detail. The error analysis of the quadrature is performed. New interpolation formula of the roots and weights is presented, which is suitable for the processor of the single-instruction multiple-data type. It is proposed to calculate only small ERIs on GPU. ERIs can be classified efficiently with the upper-bound formula. The algorithm is implemented on NVIDIA GeForce 8800 GTX and the Gaussian 03 program suite. It is applied to the test molecules Taxol and Valinomycin. The total energies calculated are essentially the same as the reference ones. The preliminary results show the considerable speedup over the commodity microprocessor. (Two-electron integral evaluation on the graphics processor unit. Koji Yasuda. Journal of Computational Chemistry. July 5, 2007.)
Posted: 16 Aug 2007 [GPGPU /Scientific Computing] # |
Categories
|