GPGPU |
General-Purpose Computation Using Graphics Hardware
|
IntroductionGPGPU stands for General-Purpose computation on GPUs. With the increasing programmability of commodity graphics processing units (GPUs), these chips are capable of performing more than the specific graphics computations for which they were designed. They are now capable coprocessors, and their high speed makes them useful for a variety of applications. The goal of this page is to catalog the current and historical use of GPUs for general-purpose computation.
|
NVIDIA to Host Full-Day CUDA Tutorial at SC08 NVIDIA will host a full-day Tutorial on "High Performance Computing with CUDA" (Module M02) at Supercomputing 2008 at the Austin Convention Center in Austin, Texas on Monday, Nov. 17, 2008 from 8:30 a.m. to 5:00 p.m. SC08 is the international conference on high performance computing, networking, storage and analysis. The tutorial is designed to give attendees a thorough introduction to the CUDA programming model. NVIDIA engineers will partner with academic and industrial researchers to present CUDA and discuss its advanced use for science and engineering applications. The morning session will introduce CUDA programming and the CUDA execution and memory models, and explore the uses of CUDA with many brief examples from diverse HPC domains. The afternoon session will cover more advanced topics and include real-world case studies from domain scientists using CUDA for computational biology, computational fluid dynamics and seismic imaging. Joining NVIDIA in this tutorial will be John Stone and Jim Phillips from the University of Illinois at Urbana-Champaign and Scott Morton from Hess Corporation, a leading global energy company, engaged in the exploration and production of crude oil and natural gas. For more information on the CUDA tutorial visit here and if you would like to attend, please register here.
Posted: 16 Oct 2008 [GPGPU /Conferences] # CUDA.NET version 2.0 is now available for download. Changes from CUDA.NET 1.1 include full support for the CUDA 2.0 API, support for double precision data types, the latest BLAS routines from CUDA 2.0, and some minor bug fixes. (CUDA.NET)
Posted: 16 Oct 2008 [GPGPU /Tools] # A Simple Compressive Sensing Algorithm for Parallel Many-Core Architectures This paper considers the l1-compressive sensing problem. It proposes an algorithm specifically designed to take advantage of shared memory, vectorized, parallel and many-core microprocessors such as the Cell processor, new generation Graphics Processing Units (GPUs) and standard vectorized multi-core processors (e.g. quad core CPUs). The paper also gives evidence of the efficiency of its approach and compares the algorithm on the three platforms, exhibiting pros and cons for each of them. (A Simple Compressive Sensing Algorithm for Parallel Many-Core Architectures. Alexandre Borghi, Jerome Darbon, Sylvain Peyronnet, Tony F. Chan and Stanley Osher. UCLA Computational and Applied Mathematics Technical Report. September 2008.)
Posted: 16 Oct 2008 [GPGPU /Image And Volume Processing] # Practical Symmetric Key Cryptography on Modern Graphics Hardware This paper presents an application-oriented approach to block cipher processing on GPUs. A new block-based conventional implementation of AES on an Nvidia G80 is shown with 4-10x speed improvements over CPU implementations and 2-4x speed increase over the previous fastest AES GPU implementation. Presented also is a general purpose data structure for representing cryptographic client requests which is suitable for execution on a GPU. The issues related to the mapping of this general structure to the GPU are explored. Finally presented is the first analysis of the main encryption modes of operation on a GPU, showing the performance and behavioural implications of executing these modes under the outlined general-purpose data model. (Practical Symmetric Key Cryptography on Modern Graphics Hardware. Owen Harrison and John Waldron, 17th USENIX Security Symposium. 2008.)
Posted: 16 Oct 2008 [GPGPU ] # NCSA to add 62 teraflops of compute power with new heterogeneous system The following is excerpted from an NVIDIA press release.
Installation has begun on a new computational resource at the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign. Lincoln will deliver peak performance of 62.3 teraflops and is designed to push the envelope in the use of heterogeneous processors for scientific computing. The system is expected to be online in October, bringing NCSA's total computational resources to nearly 170 teraflops. Lincoln will consist of 192 compute nodes (Dell PowerEdge 1950 III dual-socket nodes with quad-core Intel Harpertown 2.33GHz processors and 16GB of memory) and 96 NVIDIA Tesla S1070 accelerator units. Each Tesla unit provides 500 gigaflops of double-precision performance and 16GB of memory. Lincoln's InfiniBand interconnect fabric will be linked to the interconnect fabric of Abe, the 89-teraflop cluster that is currently NCSA's largest resource. This will enable certain applications to run across the entire complex, providing a peak "Abe Lincoln" performance of 152 teraflops. (Press Release)Posted: 16 Oct 2008 [GPGPU /Press] # Sliding-Windows for Rapid Object Class Localization: a Parallel Technique This paper by Wojek et al. presents a fast object class localization framework from TU Darmstadt implemented on a data parallel architecture currently available in recent computers. Our case study, the implementation of Histograms of Oriented Gradients (HOG) descriptors, shows that just by using this recent programming model we can easily speed up an original CPU-only implementation by a factor of 34 (with disk IO) / 109 (processing only), making it unnecessary to use early rejection cascades that sacrifice classification performance, even in real-time conditions. Using recent techniques to program the Graphics Processing Unit (GPU) allows our method to scale up to the latest, as well as to future improvements of the hardware.(Sliding-Windows for Rapid Object Class Localization: a Parallel Technique.C. Wojek, G. Dorko, A. Schulz, B. Schiele.30th DAGM Symposium (DAGM 2008), pp. 71-81, Munich, Germany)
Posted: 16 Oct 2008 [GPGPU /Image And Volume Processing/Computer Vision] # A MapReduce Framework on Graphics Processors This paper describes the design and implementation of Mars, a MapReduce framework, on graphics processors (GPUs). MapReduce is a distributed programming framework originally proposed by Google for the ease of development of web search applications on a large number of commodity CPUs. Compared with CPUs, GPUs have an order of magnitude higher computation power and memory bandwidth, but can be harder to program because their architectures are designed as a special-purpose co-processor and they have only recently introduced non-graphics programming interfaces. The authors developed Mars on an NVIDIA G80 GPU, which contains 128 processors, and evaluated it in comparison with Phoenix, the state-of-the-art MapReduce framework on multi-core CPUs. Mars hides the programming complexity of the GPU behind the simple and familiar MapReduce interface. It is up to 16 times faster than its CPU-based counterpart for six common web applications on a quad-core machine. Additionally, the authors propose a MapReduce framework with coprocessing between the GPU and the CPU for further performance improvement. Mars is developed by Bingsheng He (HKUST) and Wenbin Fang(HKUST) under the supervision of Naga K. Govindaraju (Microsoft Corp.), Qiong Luo (HKUST), and Tuyong Wang (Sina.com). Source code of Mars can be downloaded from the authors' website. (A MapReduce Framework on Graphics Processors. Bingsheng He, Wenbin Fang, Qiong Lo, Naga K. Govindaraju, and Tuyong Want. To appear in PACT 2008.)
Posted: 16 Oct 2008 [GPGPU /Data Parallel Algorithms] # Concurrent Number Cruncher: a GPU implementation of a general sparse linear solver A wide class of numerical methods needs to solve a linear system, where the matrix pattern of non-zero coefficients can be arbitrary. These problems can greatly benefit from highly multithreaded computational power and large memory bandwidth available on GPUs, especially since dedicated general purpose APIs such as CTM (AMD-ATI) and CUDA (NVIDIA) have appeared. CUDA even provides a BLAS implementation, but only for dense matrices (CuBLAS). Other existing linear solvers for the GPU are also limited by their internal matrix representation. This paper describes how to combine recent GPU programming techniques and new GPU dedicated APIs with high performance computing strategies (namely block compressed row storage, register blocking and vectorization), to implement a sparse general-purpose linear solver. This implementation of the Jacobi-preconditioned Conjugate
Gradient algorithm outperforms by up to a factor of 6.0x leading-edge CPU counterparts, making it attractive for applications which are content with single precision. (Concurrent number cruncher - A GPU implementation of a general sparse linear solver. Luc Buatois, Guillaume Caumon and Bruno Lévy. International Journal of Parallel, Emergent and Distributed Systems. To Appear.)
Posted: 16 Oct 2008 [GPGPU /Scientific Computing/Numerical Algorithms] # SPRAT: Runtime Processor Selection for Energy-aware Computing This paper by Takizawa et al. at Tohoku University describes a programming framework named Stream Programming with Runtime Auto-Tuning (SPRAT) that combines a high-level programming language with runtime processor selection. Today, a commodity PC can be seen as a hybrid computing system equipped with two different kinds of processors, i.e. CPU and GPU. Since the superiorities of GPUs in the performance and the power efficiency strongly depend on the system configuration and the data size determined at run time, a programmer cannot always know which processor should be used to execute a certain kernel. Therefore, this paper describes the SPRAT framework, which dynamically selects an appropriate processor so as to improve energy efficiency. The evaluation results clearly indicate that the run-time processor selection on execution of each kernel with the given data streams is promising for energy-aware computing on a hybrid computing system. (SPRAT:Runtime Processor Selection for Energy-aware Computing. Hiroyuki Takizawa, Katuto Sato, and Hiroaki Kobayashi. To appear in Proceedings of IEEE Cluster 2008 (the 3rd international workshop on automatic performance tuning).)
Posted: 16 Oct 2008 [GPGPU /High-Level Languages] # GPU4Vision is a project founded by the Institute for Computer Graphics and Vision, Graz University of Technology dealing with fast computer vision algorithms for tasks like basic image processing, segmentation, motion, stereo etc. On the GPU4Vision website you can take a look at the project's latest scientific publications, watch demo videos of algorithms and even download and evaluate some of them on your own PC. ( GPU4Vision - Website )
Posted: 16 Oct 2008 [GPGPU /Image And Volume Processing/Computer Vision] # |
Categories
|