Brook

From GPGPU.org Wiki

Jump to: navigation, search

Contents

The Brook Language

Brook is an extension of the C language designed to facilitate efficient use of GPUs for high intensity arithmetic operations. The language is designed around the concept of streams, which represent arrays of elements that can be independently updated in a multithreaded process.

Essential Concepts

Brook Structural Hierarchy

Here is a guide to the compilation process for brook source files.

Language Definition

Differences Between C and Brook

Keyword Extensions from C

stream definition <>
kernel functions

Functions declared with the kernel define kernel functions that map input stream elements to output stream elements. The equivalent in low-level GPGPU programming is a fragment shader. The standard syntax of a kernel function is:

kernel void MyKernel([float,...] inputStream<>, out [float, ...] outputStream<>)
{
    // this kernel function creates a copy of the input
    // multiplied by 5
    
    outputStream = inputStream * 5;
}

Most kernel functions have void return type, and these may be called from non-kernel code:

float a<SIZE>;
float b<SIZE>;
...
MyKernel( a, b ); // b[i] = a[i] * 5;

The effect of calling a kernel is that the body of the kernel is executed once for each element of the output stream (sometimes called an implicit for loop).

Although not often documented, kernels may have non-void return types (cf. the getW kernel in the fft.br brook example apps). Such kernels may be used as callable subroutines by other kernels, but cannot be called directly from non-kernel code.

Within the body of a kernel function it is illegal to access global variables or call any non-kernel functions.

input stream arguments

Stream arguments with the <> brackets specify kernel input streams. In some documentation it is said that stream dimensionality and extents should be specified with the brackets, such as:

kernel void doSomething( float a<W,H>, float b<W/2,H/2>, out float c<W,H> ) ...

but the current BrookGPU implementation ignores these annotations, so the empty <> brackets are currently preferred. It should be noted, though, that the effects of calling a kernel with input/output streams of differing rank (dimensionality) are undefined.

It is possible to call a kernel with input streams of varying extents so long as their extent in each dimension is an integer multiple or divisor of the output stream size. Input streams are aligned to the output stream size by either repeating or skipping their elements:

kernel void copy( float x<>, out float y<> ) { y = x; }
...
float a<10>;
float b<30>;
float c<5>;
...
copy( a, b ); // b[i] = a[ floor(i/3) ];
copy( b, c ); // c[i] = b[ i*6 ];
output stream arguments

A kernel function must have one or more output stream arguments, identified with the out keyword. It is an error to have an output argument that is no at stream (a constant or gather "output"). When a kernel function is called, the rank and extents of the output stream determine the number of times the kernel body is executed. As such, for kernels with more than one output stream, the rank and extents of all output streams must match.

In the current brook system it is an error for a kernel function to fail to write to its output arguments, and the contents of any output stream element that is not written to should be considered undefined.

In addition, the brook team has never clearly stated whether in place modification of streams (i.e. using a stream as both input and output arguments in a kernel call) is allowed. Many current brook applications depend on this functionality to avoid having to ping-pong between two buffers, but the current implementation does have some holes. The most prominent of these is that using a single stream as both an output stream argument and a gather stream argument will usually fail (although the particle_cloth demo application does just this).

gather stream arguments

Kernel function arguments declared with the C array [ ] brackets are gather stream arguments. These arguments may be indexed in the kernel body using the C array-indexing syntax:

kernel void simpleGather( float index<>, float g[], out float result ) {
   result = g[ index ];
}
...
float indices<INDEX_COUNT>;
float data<DATA_COUNT>;
float result<INDEX_COUNT>;
...
simpleGather( indices, data, result ); // result[i] = data[indices[i]];

Gather stream arguments are not subject to alignment (as input stream arguments are), and thus they can have arbitrary rank and extents relative to the output stream(s). As with input stream arguments, constant extents may be placed within the [ ] brackets, but will be ignored. Unlike input stream arguments, however, the rank of a gather stream argument must be declared explicitly using [ ] for one-dimensional and [ ][ ] for two-dimensional streams.

Because integers are not supported as a stream element type at present, the gather operation for a one-dimensional stream takes a float argument. No guarantees are made about the element that will be returned from a gather with a non-integral index, however. The current policy in the GPU runtimes is to round to the nearest integer (so that g[0.99] becomes g[1]), to deal with possible precision loss.

Two-dimensional gather streams should be indexed with a single float2 value, as:

float result = g[ float2( xIndex, yIndex ) ]; // similar to array[yIndex][xIndex] in C

Indexing with two different float values as:

float result = g[ yIndex ][ xIndex ];

should be supported, but appears to be incorrectly implemented in the CPU runtime. Note that the order of the indices (here xIndex and yIndex) is reversed between the C-like multidimensional indexing and the new float2 indexing scheme (see the FAQ for more on this).

The brook language allows gather stream arguments to be indexed an arbitrary number of times in a kernel, although hardware targets may impose limits on the total number of fetches allowed or the number of texture indirections supported. Writing to a gather stream argument (a so-called scatter operation), is unsupported:

g[ index ] = value; // ERROR
constant arguments

Arguments without any special Brook keywords are constant arguments. This name comes from the fact that these arguments will have the same value each time the kernel body is executed for a given kernel function call. Currently only the float, float2, float3 and float4 types are supported for constant arguments. More complex types should be passed in as multiple constant arguments.

One important feature to note is that when a kernel body calls another kernel, it is possible to supply constant arguments as the actual parameters for formal stream parameters, and vice versa:

// formal parameters are a stream and a constant
float subKernel( float a<>, float b ) {
   return a + b;
}
void mainKernel( float u, float v<>, out float w<> ) {
   // even though the formal parameters expect a
   // stream and a constant, it is valid to pass
   // a constant and a stream
   w = subKernel( u, v );
}

The simplest way to rationalize this is to think that once you are inside a kernel body (inside the implicit for loop of the top-level kernel call), both constant and stream arguments are just single values.

iterator arguments

Iterator arguments look like input stream arguments with the additional iter keyword:

kernel void iteratorGather( iter float index<>, float data[], out float result<> ) {
   result = data[ index ];
}

Conceptually, iterator arguments should behave exactly like input stream arguments. In practice certain features (such as the automatic stride and repeat supported by input stream alignment) do not always work as consistently for iterators.

It is possible to pass an iterator in a parameter position where an input stream argument is expected. In these cases a temporary stream is created and filled in (on the CPU) with the values of the iterator before being passed to the kernel function. This is a compatibility option only and should not be used in performance-critical code.

reduce functions

A function declared with the reduce keyword is a reduce function:

reduce void sum( float value<>, reduce float result<> )
{
   result += value;
}

Reduce functions should have only two arguments: an input-only stream argument and a reduce argument to which it has read-write access. The arguments can appear in any order. It should be possible to pass in additional constant or gather arguments, although this is not a typical usage and may not be well supported in the runtime. The indexof operator should not be used inside of a reduction.

A reduce function can be called from non-kernel code to collapse a stream down to a single value:

float data<COUNT>;
float sumOfData;
...
sum( data, sumOfData ); // sumOfData = data[0] + data[1] + ... + data[COUNT-1]

or it can be used to reduce a stream down to one of a smaller size by combining blocks of elements:

float data<COUNT*3>;
float reducedData<COUNT>;
...
sum( data, reducedData ); // reducedData[i] = data[3*i] + data[3*i+1] + data[3*i+2]

In the stream-to-stream reduction case, the dimensions of the target stream should evenly divide those of the source stream.

For a stream of elements e0, e1, ... eN reduction by an arbitrary operator '^' will produce a result equivalent to some parenthization of e0 ^ e1 ^ ... ^ eN, but no guarantee is made to which one. This means that a portable reduction operation should be associative. In the case of 2D reductions there is no guarantee of which dimension will be reduced first, so in such cases the operations should be commutative. Most reductions will be sum, product, minimum or maximum operations and will thus satisfy these requirements (although it should be noted that floating-point sum and product are nonassociative).

NOTE: The current implementation of reductions is not orthogonal with the rest of the Brook feature set. Reductions of streams with struct element types are not allowed, and reductions of domains are not implemented. It is possible, however, for users to implement their own reduction approach using the high-level language constructs (specifically kernel functions and stream domains).

Iterators

Iterators behave like streams in many regards, but are bound at declaration time to a particular sequence of values. For example the simple iterator declaration:

iter float it<25> = iter(0.0f,25.0f);

produces a stream containing the elements 0, 1, ..., 24.

The values of the iterator are determined by its extents and its limits, as:

iter float a<N> = iter(L0,L1); // a[i] = L0 + i*(L1-L0)/N

It can be seen that the iterator values are evenly spaced along the interval, but do not include the final value specified. It is not required that the values of the iterator be integral or increasing.

One-dimensional iterators can be defined for any of the "floatN" types, and will effectively behave as N distinct one-dimensional iterators. It is also possible to define two-dimensional iterators, but these are limited to be of float2 type:

iter float2 a<Y,X> = iter(float2(x0,y0),float2(x1,y1));
// a[float2(i,j)] = a[j][i] = float2( x0 + i*(x1-x0)/X, y0 + j*(y1-y0)/Y )

Because iterators are implemented in terms of texture coordinate interpolants on the GPU, it has been found that their values inside of a kernel body are sometimes inexact (so that an iterator that should produce 24.0 as its las value might produce 23.99 or 24.01). For this reason it is not reccomended that the values of iterators be used directly for compuation (unless they can be suitably rounded to exact values), and instead be used only for indexing of gather streams (where the tolerance of gather operations should mask some of their imprecision).

streamScatterOp

A function used to modify elements of an existing stream by combining elements from another (usually smaller) stream. Indexing into the elements of the orignial stream is done using the indices contained in the iterator argument.

Usage:

streamScatterOp(s, index_stream, newData, scatterOp = {STREAM_SCATTER_ASSIGN,STREAM_SCATTER_ADD,STREAM_SCATTER_MUL});


s - existing stream we wish to modify
index_stream - list of elements in the sream s that we wish to change
newData - data we'd like to combine with the existing stream s
scatterOP - either of an enumerated type defining the combination operation of the new data with the old, or a user defined reduction function.

NOTE: This function is implemented using a costly CPU fallback in the current Brook GPU runtimes. While it should be possible to implement an efficient STREAM_SCATTER_ASSIGN using vertex texture or render-to-vertex-array, this has not yet been added to the Brook implementation.

Gathering Data

streamGatherOp doesn't seem to work. Use the following for one dimensional streams:

kernel void myGather(out float s_out<>, float indices<>,float src[]) {
  s_out=src[indices];
}

which will take a stream as input. See also domain below.

float2, float3, float4, int2, int3, int4

Structures used to contain 2, 3, and 4 floating point elements.

typedef struct float2 {
  float2(float _x, float _y) { x = _x; y = _y; }
  float2(void) {}

  float x,y;
} float2;

typedef struct float3 {
  float3(float _x, float _y, float _z) { x = _x; y = _y; z = _z; }
  float3(void) {}

  float x,y,z;
} float3;

typedef struct float4 {
  float4(float _x, float _y, float _z, float _w) {
     x = _x; y = _y; z = _z; w = _w;
  }
  float4(void) {}

  float x,y,z,w;
} float4;

An equivalent set of definitions is available for ints.

Stream Operators
domain

The domain operator is meant to be applied to streams as they are being passed to a kernel function, as in:

kernel void copy( float input<>, out float output<> ) { output = input; }
...
float a<100>;
float b<20>;
copy( b.domain( 5, 15 ), a.domain( 20, 30 ) ); // a[i+20] = b[i+5] for 0 <= i < 10

Domains of streams can be passed to kernel functions as input stream arguments, output stream arguments and gather stream arguments. The operator can be applied to 1, 2, 3, and 4 dimensional streams, using int, int2, int3, and int4 input arguments.

While domains were originally intended to only work at a kernel call site, they can actually be copied to other stream references (often using the brook C++ runtime API) to create persistent 'views' of a stream:

brook::stream a = brook::stream::create<float>( 100 );
brook::stream b = a.domain( 20, 30 );
...
copy( data, b ); // writes to both b and a

Writing to any view of a stream (including the original stream variable) will make changes visible to all other views of that stream.

NOTE: Domain operations cannot be applied within a kernel body.

NOTE: Reductions of domains are currently unimplemented.

NOTE: The domain is specified as a half-open interval:

t = s.domain( 9, 15 ); // returns elements [9,15) of s, that is s[9],s[10], ... s[14]

NOTE: When using domain on multidimensional streams, remember that the order of dimensions in the stream declaration and the gather statements is reversed:

float s< Y_EXTENT, X_EXTENT >; // order: y, x
...
t = s.domain( int2(x_min,y_min), int2(x_max,y_max) ); // order: x, y
indexof

When used inside a kernel function, this returns the index of the current element in a given stream.

[int, int2, int3, int4] indexof(stream);

The return value type is determined by the dimension of the input stream.

streamSize

Returns the length of each dimension in a stream as float4

float4 streamSize(stream);

If the stream is less than dimension 4, the lengths appear in x, [x,y], or [x,y,z].

streamRead

Read the contents of a c array into a stream.

streamRead(stream, const void* input);
streamWrite

Write the contents of a stream to a c array.

streamWrite(stream, void* output);
streamSwap

Swaps the contents of two streams.

streamSwap(stream1, stream2);

Restrictions Imposed by Brook

no global variables
no recursion

Installing Brook

Brook can be downloaded from here. It's best to follow the instructions provided in the file called QUICK_START.txt that gets placed in the project's root folder. The online guide to getting started is out of date.

Installation FAQs

Q: I get the following error when I try to build Brook from the command line with Cygwin installed:

... cannot create link ...

A: Move the directory for Microsoft C++ link.exe ahead of the cygwin\bin directory.

Using Brook

Using Brook with Visual Studio 7.0

What I did follows.

Once you've added the Microsoft Visual C++ Toolkit 2003, you'll need to update the path information that the IDE uses to run the compiler and obtain common include and lib files. Just go to Tools/Options/Projects/VC++ Directories and insert the path at the new cl.exe to the front of the list for "executable files". Do the same for the 2003 include and lib directories. Also add the path to the cg compiler to the "executable files" list.

I created an empty console application based project in Visual Studio. I have a single .br file to start with. I compiled the brook file via brcc at the command line. This creates a .cpp file, which I include along with the .br file in the empty project. Select the properties for the .br file, and enter a custom build step. Put the full path to the brook compiler in the "Command line" field, and the name of the output .cpp file in the Ouputs field. Add the Brook include path to your project's "Additional Include Directories".

In project properties under Linker/General, add the path to the bin directory of the brook project, and the path to the directX lib directory. Add the following to Linker/Inputs/Additional Dependencies: brook.lib d3d9.lib d3dx9.lib opengl32.lib

This worked for me.

Future Directions

Support for struct input arguments to kernel functions (currently use float4).

FAQs

A list of FAQs is provided in the most recently published Brook Specification

Q: What is up with specifying stream dimensions like:

float myStream<Y, X>;

and then gathering from it like:

myStream[ float2(x,y) ];

shouldn't the order of arguments be the same?

A: This decision has caused no end of confusion among Brook users, and even the developers mess this up from time to time. The logic for this decision comes from the fact that streams are conceptually packed in row-major order and thus match C arrays. In addition, most graphics APIs (and graphics programmers) store images and textures so that the most rapidly varying dimension is the width/X dimension.

Q: My code compiled and linked fine. It even worked when I set the BRT_RUNTIME env variable to cpu. When I switch it to dx9, and run my program I get the error

Brook Runtime (gpu) - gpukernel.cpp(381): No appropriate map technique found Assertion failed: false

A: One possibility is that the streams copied to the GPU are so large that lookup of an address translation of the current kernel function fails. (see: this posting)

Q: I want to make streams more than 2 dimensions, or with one dimension larger than my GPU's maximum texture size. Should I use the automatic address-translation capabilities of the brook compiler and runtime?

A: No. Because of the design of the brook tool chain, the compiler is overly conservative when generating address-translation versions of kernels. This can greatly increase the instruction count and constant register usage of kernels. In addition, the code for handling 3- and 4-dimensional streams has a number of issues. These features should be considered deprecated.

External References

Personal tools