CUDPP CTA-Level API


Classes

class  ScanTraits< T, oper, backward, exclusive, multiRow, sums, fullBlock >
 Template class containing compile-time parameters to the scan functions. More...
class  SegmentedScanTraits< T, oper, backward, exclusivity, unroll, fullBlock, sums >
 Template class containing compile-time parameters to the segmented scan functions. More...

Sort Functions

typedef ScanTraits< unsigned
int, CUDPP_ADD, false, true,
false, false, true > 
SplitScanTraits
 Traits class used to specify options to the CTA-level scan used by split().
template<class T>
__device__ void isNotSorted (int *s_notSorted, T *s_data, T *s_temp)
 Tests if input s_data is sorted. Requires equally-sized s_temp. Does not alter s_data.
template<class T>
__device__ void reduce_sum (T *s_data)
 Parallel Sum Reduction. Adds up all elements in s_data.
template<typename T>
__device__ void setFlagBit (unsigned int *s_flags, const T *s_in, unsigned int bit, bool both=true)
 Outputs 1 for each element whose bit is set, 0 for each element whose bit is not set.
template<class T>
__device__ void split (T *s_data, unsigned int *s_addr)
 Splits s_data in place based on flags in s_addr. Data corresponding to a false flag is packed to the left of the output, s_data corresponding to a true flag is packed to the right.
template<class T>
__device__ void mergeW (T *d_out, T *s_buffer_wa, T *s_buffer_wb, unsigned int remaining)
 Merges 2 sorted sequences in s_buffer_wa and s_buffer_wb into d_out.

Scan Functions

template<class T, class traits>
__device__ void loadSharedChunkFromMem4 (T *s_out, T threadScan0[4], T threadScan1[4], const T *d_in, int numElements, int iDataOffset, int &ai, int &bi, int &aiDev, int &biDev)
 Handles loading input s_data from global memory to shared memory (vec4 version).
template<class T, class traits>
__device__ void storeSharedChunkToMem4 (T *d_out, T threadScan0[4], T threadScan1[4], T *s_in, int numElements, int oDataOffset, int ai, int bi, int aiDev, int biDev)
 Handles storing result s_data from shared memory to global memory (vec4 version).
template<class T, class traits, int maxlevel>
__device__ T warpscan (T val, volatile T *s_data)
 Scan all warps of a CTA without synchronization.
template<class T, class traits>
__device__ void scanWarps (T x, T y, T *s_data)
 Perform a full CTA scan using the warp-scan algorithm.
template<class T, class traits>
__device__ void scanCTA (T *s_data, T *d_blockSums, unsigned int blockSumIndex)
 CTA-level scan routine; scans s_data in shared memory in each thread block.
#define __SYNC
 Macro to insert necessary __syncthreads() in device emulation mode.
#define DISALLOW_LOADSTORE_OVERLAP   1

Segmented scan Functions

template<class T, typename traits>
__device__ void loadForSegmentedScanSharedChunkFromMem4 (T *s_odata, T threadScan0[4], T threadScan1[4], unsigned int &threadFlag, unsigned int *s_oflags, unsigned int *s_otflags, unsigned int *s_oindices, const T *d_idata, const unsigned int *d_iflags, int numElements, int iDataOffset, int &ai, int &bi, int &aiDev, int &biDev, int &bankOffsetA, int &bankOffsetB)
 Handles loading input s_data from global memory to shared memory (vec4 version).
template<class T, unsigned int ctasize, class traits>
__device__ unsigned int buildSegmentedSumTree (T *s_data, unsigned int *s_indices, unsigned int *s_tflags)
 Down-sweep (aka reduce) phase of CTA-level segmented scan in shared memory.
template<class T, class traits>
__device__ void clearLastElementAndStoreSumFlagIndex (T *s_data, unsigned int *s_tflags, unsigned int *s_indices, T *d_blockSums, unsigned int *d_blockFlags, unsigned int *d_blockIndices)
 Intermediate phase between down- and up-sweep phases of CTA-level segmented scan.
template<class T, unsigned int ctasize, class traits>
__device__ void segmentedScanRootToLeaves (T *s_data, unsigned int *s_flags, unsigned int *s_tflags, unsigned int stride)
 Up-sweep phase of CTA-level segmented scan in shared memory.
template<class T, class traits>
__device__ void storeForSegmentedScanSharedChunkToMem4 (T *d_odata, T threadScan0[4], T threadScan1[4], unsigned int threadFlag, T *s_idata, unsigned int numElements, int oDataOffset, int ai, int bi, int aiDev, int biDev, int bankOffsetA, int bankOffsetB)
 Handles storing result s_data from shared memory to global memory (vec4 version).
template<class T, class traits>
__device__ void segmentedScanCTA (T *s_data, unsigned int *s_flags, unsigned int *s_tflags, unsigned int *s_indices, T *d_blockSums=0, unsigned int *d_blockFlags=0, unsigned int *d_blockIndices=0)
 CTA-level segmented scan routine;.
#define _SUM_SEGMENTED_STEP_NOSYNC(d)
#define _SUM_SEGMENTED_STEP_SYNC(d)
#define _SUM_SEGMENTED_STEP   _SUM_SEGMENTED_STEP_NOSYNC
#define SUM_SEGMENTED_STEP_SYNC(d)
#define SUM_SEGMENTED_STEP(d)
#define _SCAN_SEGMENTED_STEP_NOSYNC(d)
#define _SCAN_SEGMENTED_STEP_SYNC(d)
#define _SCAN_SEGMENTED_STEP   _SCAN_SEGMENTED_STEP_NOSYNC
#define SCAN_SEGMENTED_STEP_SYNC(d)
#define SCAN_SEGMENTED_STEP(d)

Detailed Description

The CUDPP CTA-Level API contains functions that run on the GPU device. These are CUDA __device__ functions that are called from within other CUDA device functions (typically CUDPP Kernel-Level API functions). They are called CTA-level functions because they typically process s_data "owned" by each CTA within shared memory, and are agnostic of any other CTAs that may be running (or how many CTAs are running), other than to compute appropriate global memory addresses.

Define Documentation

#define DISALLOW_LOADSTORE_OVERLAP   1

This is used to insert syncthreads to avoid perf loss caused by 128-bit load overlap that happens on G80. This gives about a 15% boost on scans on G80.

Todo:
Parameterize this in case this perf detail changes on future GPUs.


Typedef Documentation

typedef ScanTraits<unsigned int, CUDPP_ADD, false, true, false, false, true> SplitScanTraits

Traits class used to specify options to the CTA-level scan used by split().

The split scan is an enumerate scan. In other words it is a forward, exclusive, single-row, single-block (actually a separate scan per CTA) sum scan, and all blocks are full (i.e. all threads are active).


Function Documentation

template<class T, class traits>
__device__ void loadSharedChunkFromMem4 ( T *  s_out,
threadScan0[4],
threadScan1[4],
const T *  d_in,
int  numElements,
int  iDataOffset,
int &  ai,
int &  bi,
int &  aiDev,
int &  biDev 
) [inline]

Handles loading input s_data from global memory to shared memory (vec4 version).

Load a chunk of 8*blockDim.x elements from global memory into a shared memory array. Each thread loads two T4 elements (where T4 is, e.g. int4 or float4), computes the scan of those two vec4s in thread local arrays (in registers), and writes the two total sums of the vec4s into shared memory, where they will be cooperatively scanned with the other partial sums by all threads in the CTA.

Parameters:
[out] s_out The output (shared) memory array
[out] threadScan0 Intermediate per-thread partial sums array 1
[out] threadScan1 Intermediate per-thread partial sums array 2
[in] d_in The input (device) memory array
[in] numElements The number of elements in the array being scanned
[in] iDataOffset the offset of the input array in global memory for this thread block
[out] ai The shared memory address for the thread's first element (returned for reuse)
[out] bi The shared memory address for the thread's second element (returned for reuse)
[out] aiDev The device memory address for this thread's first element (returned for reuse)
[out] biDev The device memory address for this thread's second element (returned for reuse)

template<class T, class traits>
__device__ void storeSharedChunkToMem4 ( T *  d_out,
threadScan0[4],
threadScan1[4],
T *  s_in,
int  numElements,
int  oDataOffset,
int  ai,
int  bi,
int  aiDev,
int  biDev 
) [inline]

Handles storing result s_data from shared memory to global memory (vec4 version).

Store a chunk of SCAN_ELTS_PER_THREAD*blockDim.x elements from shared memory into a device memory array. Each thread stores reads two elements from shared memory, adds them to the intermediate sums computed in loadSharedChunkFromMem4(), and writes two T4 elements (where T4 is, e.g. int4 or float4) to global memory.

Parameters:
[out] d_out The output (device) memory array
[in] threadScan0 Intermediate per-thread partial sums array 1 (contents computed in loadSharedChunkFromMem4())
[in] threadScan1 Intermediate per-thread partial sums array 2 (contents computed in loadSharedChunkFromMem4())
[in] s_in The input (shared) memory array
[in] numElements The number of elements in the array being scanned
[in] oDataOffset the offset of the output array in global memory for this thread block
[in] ai The shared memory address for the thread's first element (computed in loadSharedChunkFromMem4())
[in] bi The shared memory address for the thread's second element (computed in loadSharedChunkFromMem4())
[in] aiDev The device memory address for this thread's first element (computed in loadSharedChunkFromMem4())
[in] biDev The device memory address for this thread's second element (computed in loadSharedChunkFromMem4())

template<class T, class traits, int maxlevel>
__device__ T warpscan ( val,
volatile T *  s_data 
) [inline]

Scan all warps of a CTA without synchronization.

The warp-scan algorithm breaks a block of data into warp-sized chunks, and scans the chunks independently with a warp of threads each. Because warps execute instructions in SIMD fashion, there is no need to synchronize in order to share data within a warp (only across warps). Also, in SIMD the most efficient algorithm is a step-efficient algorithm. Therefore, within each warp we use a Hillis-and-Steele-style scan that takes log2(N) steps to scan the warp [Daniel Hillis and Guy Steele 1986], rather than the work-efficient tree-based algorithm described by Guy Blelloch [1990] that takes 2 * log(N) steps and is in general more complex to implement. Previous versions of CUDPP used the Blelloch algorithm. For current GPUs, the warp size is 32, so this takes five steps per warp.

Each thread is responsible for a single element of the array to be scanned. Each thread inputs a single value to the scan via val and returns its own scanned result element. The threads of each warp cooperate via the shared memory array s_data to scan WARP_SIZE elements.

Template parameter maxlevel allows this warpscan to be performed on partial warps. For example, if only the first 8 elements of each warp need to be scanned, then warpscan only performs log2(8)=3 steps rather than 5.

The computation uses 2 * WARP_SIZE elements of shared memory per warp to enable warps to offset beyond their input data and receive the identity element without using any branch instructions.

Note:
s_data is declared volatile here to prevent the compiler from optimizing away writes to shared memory, and ensure correct intrawarp communication in the absence of __syncthreads.
Returns:
The result of the warp scan for the current thread
Parameters:
[in] val The current threads's input to the scan
[in,out] s_data A pointer to a temporary shared array of 2*CTA_SIZE elements used to compute the warp scans

template<class T, class traits>
__device__ void scanWarps ( x,
y,
T *  s_data 
) [inline]

Perform a full CTA scan using the warp-scan algorithm.

As described in the comment for warpscan(), the warp-scan algorithm breaks a block of data into warp-sized chunks, and scans the chunks independently with a warp of threads each. To complete the scan, each warp j then writes its last element to element j of a temporary shared array. Then a single warp exclusive-scans these "warp sums". Finally, each thread adds the result of the warp sum scan to the result of the scan from the first pass.

Because we scan 2*CTA_SIZE elements per thread, we have to call warpscan twice.

Parameters:
x The first input value for the current thread
y The second input value for the current thread
s_data Temporary shared memory space of 2*CTA_SIZE elements for performing the scan

template<class T, class traits>
__device__ void scanCTA ( T *  s_data,
T *  d_blockSums,
unsigned int  blockSumIndex 
) [inline]

CTA-level scan routine; scans s_data in shared memory in each thread block.

This function is the main CTA-level scan function. It may be called by other CUDA __global__ or __device__ functions. This function scans 2 * CTA_SIZE elements. Each thread is responsible for one element in each half of the input array.

Note:
This code is intended to be run on a CTA of 128 threads. Other sizes are untested.
Parameters:
[in] s_data The array to be scanned in shared memory
[out] d_blockSums Array of per-block sums
[in] blockSumIndex Location in d_blockSums to which to write this block's sum

template<class T, typename traits>
__device__ void loadForSegmentedScanSharedChunkFromMem4 ( T *  s_odata,
threadScan0[4],
threadScan1[4],
unsigned int &  threadFlag,
unsigned int *  s_oflags,
unsigned int *  s_otflags,
unsigned int *  s_oindices,
const T *  d_idata,
const unsigned int *  d_iflags,
int  numElements,
int  iDataOffset,
int &  ai,
int &  bi,
int &  aiDev,
int &  biDev,
int &  bankOffsetA,
int &  bankOffsetB 
) [inline]

Handles loading input s_data from global memory to shared memory (vec4 version).

Load a chunk of 8*blockDim.x elements from global memory into a shared memory array. Each thread loads two T4 elements (where T4 is, e.g. int4 or float4), computes the segmented scan of those two vec4s in thread local arrays (in registers), and writes the two total sums of the vec4s into shared memory, where they will be cooperatively scanned with the other partial sums by all threads in the CTA.

Parameters:
[out] s_odata The output (shared) memory array
[out] threadScan0 Intermediate per-thread partial sums array 1
[out] threadScan1 Intermediate per-thread partial sums array 2
[out] threadFlag Intermediate array which holds 8 flags as follows Temporary register threadFlag0[4] - the flags for the first 4 elements read Temporary register threadFlag1[4] - the flags for the second 4 elements read Temporary register threadScanFlag0[4] - the inclusive OR-scan for the flags in threadFlag0[4] Temporary register threadScanFlag1[4] - the inclusive OR-scan for the flags in threadFlag1[4] We storing the 16 flags 32 bits of threadFlag Bits 0...3 contains threadFlag0[0]...threadFlag0[3] Bits 4...7 contains threadFlag1[0]...threadFlag1[3] Bits 8...11 contains threadScanFlag0[0]...threadScanFlag0[3] Bits 11...15 contains threadScanFlag1[0]...threadScanFlag1[3]
[out] s_oflags Output (shared) memory array of segment head flags
[out] s_otflags Output (shared) temporary array of modifiable head flags
[out] s_oindices Output (shared) memory array of indices. If a flag for a position (1-based) is set then index for that position is the position, 0 otherwise.
[in] d_idata The input (device) memory array
[in] d_iflags The input (device) memory array of segment head flags
[in] numElements The number of elements in the array being scanned
[in] iDataOffset the offset of the input array in global memory for this thread block
[out] ai The shared memory address for the thread's first element (returned for reuse)
[out] bi The shared memory address for the thread's second element (returned for reuse)
[out] aiDev The device memory address for this thread's first element (returned for reuse)
[out] biDev The device memory address for this thread's second element (returned for reuse)
[out] bankOffsetA the offset to ai used to alleviate bank conflicts (returned for reuse)
[out] bankOffsetB the offset to bi used to alleviate bank conflicts (returned for reuse)

template<class T, unsigned int ctasize, class traits>
__device__ unsigned int buildSegmentedSumTree ( T *  s_data,
unsigned int *  s_indices,
unsigned int *  s_tflags 
) [inline]

Down-sweep (aka reduce) phase of CTA-level segmented scan in shared memory.

This is the first phase of the balanced tree scan operation.

Parameters:
[in] s_data Shared memory array in which to build the sum tree.
[in] s_indices Shared memory array in which to build the index tree. The index tree is only used in this phase to calculate a min-reduction of the indices.
[in] s_tflags Shared memory array in which to build the flags tree.
Returns:
The stride between elements at the root of the tree

template<class T, class traits>
__device__ void clearLastElementAndStoreSumFlagIndex ( T *  s_data,
unsigned int *  s_tflags,
unsigned int *  s_indices,
T *  d_blockSums,
unsigned int *  d_blockFlags,
unsigned int *  d_blockIndices 
) [inline]

Intermediate phase between down- and up-sweep phases of CTA-level segmented scan.

Copy the last element of a shared memory array into the device memory array d_blockSums at blockIndex, and then set the last element of the shared memory array to the identity element.

Parameters:
[in] s_data The shared memory array.
[in] s_tflags The shared memory read-write flags array.
[in] s_indices The shared memory index array.
[out] d_blockSums The array of block sums to which the last element (the sum) is written.
[out] d_blockFlags The array of block flags to which the last element (the OR-reduction) is written.
[out] d_blockIndices The array of block indices to which the last element (the min-reduction) is written.

template<class T, unsigned int ctasize, class traits>
__device__ void segmentedScanRootToLeaves ( T *  s_data,
unsigned int *  s_flags,
unsigned int *  s_tflags,
unsigned int  stride 
) [inline]

Up-sweep phase of CTA-level segmented scan in shared memory.

This is the second phase of the balanced tree scan operation.

Parameters:
[in] s_data The shared memory array to scan.
[in] s_flags The shared memory array of read-only flags.
[in] s_tflags The shared memory array of read-write flags.
[in] stride between elements at the starting level of the tree (returned by buildSegmentedSumTree())

template<class T, class traits>
__device__ void storeForSegmentedScanSharedChunkToMem4 ( T *  d_odata,
threadScan0[4],
threadScan1[4],
unsigned int  threadFlag,
T *  s_idata,
unsigned int  numElements,
int  oDataOffset,
int  ai,
int  bi,
int  aiDev,
int  biDev,
int  bankOffsetA,
int  bankOffsetB 
) [inline]

Handles storing result s_data from shared memory to global memory (vec4 version).

Store a chunk of 8*blockDim.x elements from shared memory into a device memory array. Each thread stores reads two elements from shared memory, adds them while respecting segment bouldaries, to the intermediate sums computed in loadForSegmentedScanSharedChunkFromMem4(), and writes two T4 elements (where T4 is, e.g. int4 or float4) to global memory.

Parameters:
[out] d_odata The output (device) memory array
[out] threadScan0 Intermediate per-thread partial sums array 1 (contents computed in loadForSegmentedScanSharedChunkFromMem4())
[in] threadScan1 Intermediate per-thread partial sums array 2 (contents computed in loadForSegmentedScanSharedChunkFromMem4())
[in] threadFlag Various flags that loadForSegmentedScanSharedChunkFromMem4() needs to pass
[in] s_idata The input (shared) memory array
[in] numElements The number of elements in the array being scanned
[in] oDataOffset the offset of the output array in global memory for this thread block
[in] ai The shared memory address for the thread's first element (computed in loadForSegmentedScanSharedChunkFromMem4())
[in] bi The shared memory address for the thread's second element (computed in loadForSegmentedScanSharedChunkFromMem4())
[in] aiDev The device memory address for this thread's first element (computed in loadForSegmentedScanSharedChunkFromMem4())
[in] biDev The device memory address for this thread's second element (computed in loadForSegmentedScanSharedChunkFromMem4())
[in] bankOffsetA the offset to ai used to alleviate bank conflits (computed in loadForSegmentedScanSharedChunkFromMem4())
[in] bankOffsetB the offset to bi used to alleviate bank conflits (computed in loadForSegmentedScanSharedChunkFromMem4())

template<class T, class traits>
__device__ void segmentedScanCTA ( T *  s_data,
unsigned int *  s_flags,
unsigned int *  s_tflags,
unsigned int *  s_indices,
T *  d_blockSums = 0,
unsigned int *  d_blockFlags = 0,
unsigned int *  d_blockIndices = 0 
) [inline]

CTA-level segmented scan routine;.

Performs segmented scan on s_data in shared memory in each thread block with head flags in s_flags (s_tflags is a read-write copy of the head flags which are modified).

This function is the main CTA-level segmented scan function. It may be called by other CUDA __global__ or __device__ functions.

Note:
This code is intended to be run on a CTA of 128 threads. Other sizes are untested.
Parameters:
[in] s_data Array to be scanned in shared memory
[in] s_flags Read-only version of flags in shared memory
[in] s_tflags Read-write (temporary) version of flags in shared memory
[in] s_indices Temporary read-write indices array
[out] d_blockSums Array of per-block sums
[out] d_blockFlags Array of per-block OR-reduction of flags
[out] d_blockIndices Array of per-block min-reduction of indices

template<class T>
__device__ void isNotSorted ( int *  s_notSorted,
T *  s_data,
T *  s_temp 
) [inline]

Tests if input s_data is sorted. Requires equally-sized s_temp. Does not alter s_data.

Compares each element to its neighbor, generating a 1 for out-of-order, 0 for in-order; this is input to a sum tree generating the final s_notSorted. Should really use an and-tree.

Parameters:
[in] s_data Data to be tested; will not be changed.
[in] s_temp Temporary buffer, same size as s_data.
[out] s_notSorted Contains return value
Returns:
Return value in s_notSorted, 0 if sorted, >0 if not

template<class T>
__device__ void reduce_sum ( T *  s_data  )  [inline]

Parallel Sum Reduction. Adds up all elements in s_data.

Todo:
[MJH] obviously reductions deserve to be first class citizens of CUDPP.
Assumes block is fully populated.

Parameters:
[in,out] s_data Contains both input and output
Returns:
Return value is in s_data[0]

template<typename T>
__device__ void setFlagBit ( unsigned int *  s_flags,
const T *  s_in,
unsigned int  bit,
bool  both = true 
) [inline]

Outputs 1 for each element whose bit is set, 0 for each element whose bit is not set.

Processes two elements per thread. Note that it will first process one element per thread for all threads, then the second element per thread.

Parameters:
[out] s_flags Output flag per element: 1 if bit is set in input element, 0 if bit is not set
[in] s_in Input s_data
[in] bit Which bit is checked (one-hot)
[in] both Set two flag bits if checked, otherwise just one
See also:
radixGlobalSetup_kernel

template<class T>
__device__ void split ( T *  s_data,
unsigned int *  s_addr 
) [inline]

Splits s_data in place based on flags in s_addr. Data corresponding to a false flag is packed to the left of the output, s_data corresponding to a true flag is packed to the right.

split() uses an input flag per element (s_addr) to divide the elements in the input into two parts, all the elements with "false" flags in the left part and all the elements with "true" flags in the right part. New head flags are set and new segments are created.

Example: input: [f t t f f t f t f t f] output: [f f f f f f][t t t t t]

Algorithm: We assume the flags are previously input in s_addr, but are the wrong polarity

  1. Set a "1" in all false elements and a "0" for all true elements e = [1 0 0 1 1 0 1 0 1 0 1] This is done when filling the s_scanflags structure
  2. Seeded-enumerate (seeded-sum-scan) those elements f = [0 1 1 1 2 3 3 4 4 5 5] This is the scatter address for falses This is done in-place in s_scanflags, and register variables flag0 and flag1 hold the flags
  3. Add e+f; last element of each segment is # of falses in that segment plus the index of the head element of that segment e+f = [x x x x x x x x x 6]
  4. Back-copy that last element in each segment to all elements b = [6 6 6 6 6 6 6 6 6 6 6] Steps 3 and 4 are done with a shared totalFalses variable
  5. Calculate the scatter address for trues, thid-f+b t = [6 6 7 8 8 8 9 9 10 10 11]
  6. Select between t and f depending on input t/f in= [f t t f f t f t f t f] t = [6 6 7 8 8 8 9 9 10 10 11] f = [0 1 1 1 2 3 3 4 4 5 5] a = [0 6 7 1 2 8 3 9 4 10 5] Note a is the scatter address. Steps 5 and 6 are done together with an if statement; 't' is calculated and set in s_addr if flag0/1 is set
  7. Scatter to the output (s_data)

Parameters:
[in,out] s_data Input and output s_data go here
[in] s_addr Flags are initially placed here; they are replaced with the computed addresses used for final scatter
See also:
radixSort_kernel

template<class T>
__device__ void mergeW ( T *  d_out,
T *  s_buffer_wa,
T *  s_buffer_wb,
unsigned int  remaining 
) [inline]

Merges 2 sorted sequences in s_buffer_wa and s_buffer_wb into d_out.

Pairwise-compares s_buffer_wa and s_buffer_wb to create a bitonic sequence, then uses bitonic sort to merge them.

Typical usage would be s_buffer_wa and s_buffer_wb in shared memory and d_out in main GPU memory.

Assumes number of threads is a power of 2.

Parameters:
[out] d_out Output s_in (for this pass) in GPU main memory
[in] s_buffer_wa Input buffer 1 (sorted small to large)
[in] s_buffer_wb Input buffer 2 (sorted large to small)
[in] remaining How many items are left to merge? Don't write too many.
See also:
mergeChunks


Generated on Sun Apr 20 19:51:33 2008 for CUDPP by  doxygen 1.5.5