(Spurzem et al. 2009)(Part of the content of this page was presented at the hardware part of the talk of R. Spurzem on Wed Feb. 17 14:00).
Direct Link to Section Anchors:
Introduction GRAPE GPU
How to program these guys?
How to use this in the Grid?

  • Introduction

Why special hardware? Because direct N-Body scales with N^2, and even Moore's Law does not help us if we want to increase particle numbers to realistic values; so D. Sugimoto realized in a Nature Article in 1990 that one needs special hardware to detect gravothermal oscillations in N-Body models of globular clusters, a result published by Makino 1996 with the use of GRAPE-6.
Another issue - paralelization requires very large bandwidths and low latency, because gravity is not shielded, and the entire system, no matter how large N, is in physical interaction with itself.

  • GRAPE = Gravity Pipe

Still many things to be added here, at least here the link to the Japanese GRAPE project page and the more recent GRAPE-DR project webpage....

And another link to a recent collaborative (Amsterdam, Heidelberg, Rochester) profiling and timing of the simple parallel N-Body phiGRAPE on parallel GRAPE clusters in Heidelberg and Rochester: Harfst, Gualandris, Merritt, Spurzem, Portegies Zwart, Berczik, 2007 .Note in passing - the massively paralle NBODY6++ code (see Manual by E. Khalisi & R. Spurzem) - it is derived from NBODY6, and NO it is NOT in C++ but in Fortran / MPI - has not yet been ported to GRAPE or GPU clusters, this is work in progress. See also Sverre's N-Body Webpage and the very plain NBODY6 Download Page (please look first at README).

  • GPU = Graphical Processing Units

GPUs generally perform equally well than GRAPE, sometimes better, for a smaller price. In addition to that they are programmable, recently with the high-level language CUDA provided by NVIDIA. . CUDA and GPU are becoming widely used in nearly any branch of computational science and data processing. See e.g. the press release on the first 40 node GPU cluster at the University of Heidelberg. Two other useful starting points to obtain informations are the 2007 Princeton AstroGPU workshop and the series of books provided by NVIDIA titled GPU Gems - Programming Techniques on Graphics Cards (also provided online by NVIDIA). In particular the reader might be interested to read the Section on Fast N-Body Simulation with CUDA (Nyland, Harris, Prins) in GPU Gems 3 (see also list of citations therein, look in particular for references of Hamada and Portegies Zwart). All of these implementations were single precision, but an emulation of double precision without much loss of efficiency is possible. This has been nicely presented by Keigo Nitadori in his talk in the Turku N-Body Workshop (see Chapter 2). But several other science communities have been using GPUs, e.g. the Lattice QCD people (Egri et al. 2007)
using another algorithm to emulate double precisionfor Finite Element Methods with the help of the host (Göddeke, Strzodka & Turek 2005).
Another prototypical application suitable for GPU is the Fast Fourier Transform (FFT), used in data processing but also frequently for N-body simulations (cosmology, galaxies) using an approximate mesh-based potential (see Particle Mesh Codes in general, e.g. the example of the Superbox code used for simulations of tidal tails of clusters). FFT implementations on GPU are described generally in GPU Gems 2 Chapter 48. A randomly found science paper on some improvements by Romero et al. 2007; an FFT implementation using a double precision emulation similar to the one in N-Body can be found in one of the posters presented at the Clusters09 conference (Mangete 2009). Last, but not least, two examples of papers using GPU in Computational Chemistry (Yang, Wang & Chen 2007) and Molecular Dynamics (Yasuda, 2007) (see also computational chemistry page of NVIDIA). The latter applications limited themselves to single precision.
gpu.jpgtesla.jpgLeft: NVIDIA GeForce 8800 GTX, with 128 Stream Processors (organized in 16 multi-processors, which each can handle 8 threads) and 768 MB of onboard memory, Right: NVIDIA Tesla C1060, as it is used in our 40 node kolob cluster in Heidelberg (bottom left of box) inside one of the computing nodes.

  • FPGA = Field Programmable Gate Array

GPUs are much more flexible than GRAPE, but they have two problems - firstly, programming them in an optimal way needs special knowledge about the memory and hardware structure, even when done in CUDA; and secondly they consume much more energy per Teraflop/s than present day modern supercomputers (e.g. IBM Blue Gene). Therefore we advocate the new paradigm of special green supercomputing, using FPGA chips. At the University of Heidelberg we have designed MPRACE-1 and MPRACE-2 boards using FPGA, which consume much less power, and are completely freely programmable chips.

Here you see sketches of MPRACE-1 and MPRACE-2 designed by the FPGA research group of Computer Engineering Department of Univ. of Heidelberg , and used to accelerate astrophysical N-Body and SPH algorithms in the GRACE project. Due to some delay in the production and commissioning the MPRACE-2 is presently not yet available for benchmarks as planned. If we just scale up technical data of MPRACE-2 from MPRACE-1 (larger chip area, higher clock rate) using the benchmarks in the paper cited below (Spurzem et al. 2009)
MPRACE-2 could be faster for SPH computations than a GPU - but this is still subject to a real test simulation.
Left: MPRACE-1 board bottom right, in clockwise direction: design sketch, color map of used silicon for one SPH pipeline, sketch of subunits on FPGA. Right:prototype board of MPRACE-2.

We have implemented gravitational forces for N-Body and SPH on both FPGA and GPU boards. Our results have been published in the 2007 European SPHERIC Workshop in Madrid (SPHERIC = SPH European Research Interest Group) in Berczik et al. 2007 and Marcus et al. 2007 (these papers can be downloaded included in the full workshop book (pdf file). An overview over all our results on N-Body and SPH on GPU and FPGA has been submitted to the International Supercomputing Conference 2009 - the paper is still in the refereeing process and provided here as a draft "as is", without warranty, and subject to change (Spurzem et al. 2009). Our main result is that the computational efficiency of FPGA hardware is much better compared to GPU for complicated pipelines like SPH, and still reasonable for gravity pipelines. Efficiency here means both performance relative to peak performance (roughly a measure how well the silicon is used) and performance per Watt of power supplied. At this moment, for gravitational forces in very large N systems, we do not know any hardware which is cheaper and faster than a GPU, but this may soon change with the advent of new generations of power-saving and fast FPGA chips and boards.

So what is the problem with FPGA, why it's not more widely used? Well, nobody has created a fine high-level language programming interface like CUDA for it - programming needs to be done in a kind of machine code called VHDL = VHSIC hardware description language. A small number of groups worldwide including Univ. of Heidelberg are involved in present and future projects to create such a new language. See also the Progrape Webpage of T. Hamada.

  • How to program these guys?

  1. API's - Programmer Interfaces. GRAPE Users are used to link some program libraries; opening and closing the special hardware board and the force computations are done by substituting lines in your code through these function calls, which are somehow documented (see e.g. GRAPE6 software reference library). People have been working to provide completely equivalent libraries (same usage) which actually work on GPUs (see kirin library supplied by the Amsterdam group of Simon Portegies Zwart, and see also Tsuyoshi Hamada's slightly more fancy (sorry, Simon...) Chamomile / CUNBODY-1 library, all single precision). The GRACE project (funded by Volkswagen foundation) at Heidelberg will publish this year interface routines racegrav (gravity on GPU and FPGA), raceSPH (SPH on FPGA) and cudaSPH (SPH on GPU). Interested early adopters please contact Rainer Spurzem for access to testing nodes. These should be interchangeable between FPGA and GPU hardware boards, but we are not quite there yet.
  2. High-level language extensions, quickly digestible for experienced C- or Fortran programmers, which describe the use of special hardware devices in a generic way, independent of hardware and vendors. GPGPU and BrookeGPU belong to this class, and also the nowadays so well publicised and (NVIDIA) supported CUDA language (here is a Wikipedia CUDA link independent of NVIDIA). Recently it seems OpenCL is where all this might converge to (again a Wikipedia OpenCL link) - but these things are quickly evolving at the time this is being written (2009). From personal experience (RS) CUDA is well supported for a hands-on approach, ready to be taught to students and collaborators and running fairly well. OpenCL has not yet a well-documented reference installation and is quite new.
  3. Pipeline Generators. This is what we all would like in the future. Use some drag-and-drop graphical interface, by which you can combine computational elements (vector-matrix multpliers, 1/r^x for forces between particles and so on), specify the required precision (wordlength), combine it into a pipeline and then just let it automatically be compiled and executed on any hardware, be it GPU or FPGA or just a normal CPU. Well there are some approaches. E.g. the very nice paper by Hamada & Fukushige 2005 (see the list of provided functions by Hamada) - but a nice graphical interface is not yet there as far as the author of these lines knows. In the GRACE project we have been developing pipeline generators for floating point computations of variable precision - a glimpse of that can be found in thisLienhart et al. 2006 paper and in the following graphical view of two generated pipelines. These pipelines are translated into VHDL (for FPGA) or CUDA (for GPU), transparent to the user. This is work in progress. pipeline1.jpgpipeline2.jpg
Figure: Pipelines generated from high-level language elements, left: example of language and graphical display for simple path-time-law computation, right, graphical display of one of the two SPH pipelines (pressure force, note that not the finest resolution of operations is displayed, otherwise you would see even less...). Unpublished Work from GRACE Project Pipeline generator by Gerhard Lienhart and collaborators of Computer Engineering at the University of Heidelberg, department Computer Science V.

  • How to use this in the Grid?

Recently many funding agencies and governments support e-science and grid infrastructures. Well developed projects in my personal opinion are e.g. the Japanese NAREGI and the US Teragrid. The pan-European grid EGEE is catching up, and there are numerous national grids such as Astrogrid-UK or the Ukrainian Grid (just to give examples, not exhaustive). We have contributed NBODY6 as a use case for the German Astrogrid-D. In currently ongoing projects Nbody job scripts are translated into a portable XML based language (JSDL) and submitted through a submit script and the Gridway Metascheduler (part of Globus Open Source Project) to other resources. GRAPE and GPU clusters are included in our grid and we are currently working to allow the specification of accelerator hardware for the grid job submission. If you are curious and have a careful look into the list of resources available to job submission in Astrogrid-D you will find our two GRAPE clusters in Heidelberg (titan.ari.uni-heidelberg.de) and Kiev (golowood.mao.kiev.ua). This has been achieved through a memorandum of understaning between the Main Astronomical Observatory in Kiev (as P.I. of the Ukrainian grid) and the Astrogrid-D consortium - we consider this as a prototype for future international standardized grid collaboration.