Compact Functions | |
| void | calculatCompactLaunchParams (const unsigned int numElements, unsigned int &numThreads, unsigned int &numBlocks, unsigned int &numEltsPerBlock) |
| Calculate launch parameters for compactArray(). | |
| template<class T> | |
| void | compactArray (T *d_out, size_t *d_numValidElements, const T *d_in, const unsigned int *d_isValid, size_t numElements, const CUDPPCompactPlan *plan) |
| Compact the non-zero elements of an array. | |
| void | allocCompactStorage (CUDPPCompactPlan *plan) |
| Allocate intermediate arrays used by cudppCompact(). | |
| void | freeCompactStorage (CUDPPCompactPlan *plan) |
| Deallocate intermediate storage used by cudppCompact(). | |
| void | cudppCompactDispatch (void *d_out, size_t *d_numValidElements, const void *d_in, const unsigned int *d_isValid, size_t numElements, const CUDPPCompactPlan *plan) |
| Dispatch compactArray for the specified datatype. | |
Scan Functions | |
| template<class T, bool isBackward, bool isExclusive, CUDPPOperator op> | |
| void | scanArrayRecursive (T *d_out, const T *d_in, T **d_blockSums, size_t numElements, size_t numRows, const size_t *rowPitches, int level) |
| Perform recursive scan on arbitrary size arrays. | |
| void | allocScanStorage (CUDPPScanPlan *plan) |
| Allocate intermediate arrays used by scan. | |
| void | freeScanStorage (CUDPPScanPlan *plan) |
| Deallocate intermediate block sums arrays in a CUDPPScanPlan object. | |
| void | cudppScanDispatch (void *d_out, const void *d_in, size_t numElements, size_t numRows, const CUDPPScanPlan *plan) |
| Dispatch function to perform a scan (prefix sum) on an array with the specified configuration. | |
Segmented Scan Functions | |
| template<class T, CUDPPOperator op, bool isBackward, bool isExclusive> | |
| void | segmentedScanArrayRecursive (T *d_out, const T *d_idata, const unsigned int *d_iflags, T **d_blockSums, unsigned int **d_blockFlags, unsigned int **d_blockIndices, int numElements, int level) |
| Perform recursive scan on arbitrary size arrays. | |
| void | allocSegmentedScanStorage (CUDPPSegmentedScanPlan *plan) |
| Allocate intermediate block sums, block flags and block indices arrays in a CUDPPSegmentedScanPlan class. | |
| void | freeSegmentedScanStorage (CUDPPSegmentedScanPlan *plan) |
| Deallocate intermediate block sums, block flags and block indices arrays in a CUDPPSegmentedScanPlan class. | |
| void | cudppSegmentedScanDispatch (void *d_out, const void *d_idata, const unsigned int *d_iflags, int numElements, const CUDPPSegmentedScanPlan *plan) |
| Dispatch function to perform a scan (prefix sum) on an array with the specified configuration. | |
Sort Functions | |
| template<class T> | |
| void | populateSortPlan (CUDPPSortPlan *plan) |
| From the programmer-specified sort configuration, generate the internal CUDPPSortConfigInternal structure for performing the sort. | |
| void | allocSortStorage (CUDPPSortPlan *plan) |
| Allocate intermediate storage for sort. | |
| void | freeSortStorage (CUDPPSortPlan *plan) |
| Deallocate intermediate storage for sort. | |
| template<class T> | |
| __host__ void | sortChunks (T *d_out, const T *d_in, const CUDPPSortPlan *plan) |
| Sorts input in chunks. | |
| template<class T> | |
| __host__ void | mergeChunks (T *d_data, const CUDPPSortPlan *plan) |
| Merges sorted chunks together into fully sorted output. | |
| template<class T> | |
| void | sortRadixGlobal (T *d_out, const T *d_in, size_t numElements, const CUDPPSortPlan *plan) |
| Performs global radix sort, with no merge. | |
| template<class T> | |
| void | sortApp (T *d_out, const T *d_in, size_t numElements, const CUDPPSortPlan *plan) |
| Main sort routine. | |
| void | cudppSortDispatch (void *d_out, const void *d_in, size_t numElements, const CUDPPSortPlan *plan) |
| Dispatch sortApp() for the specified datatype. | |
Sparse Matrix-Vector Multiply Functions | |
| template<class T> | |
| void | sparseMatrixVectorMultiply (T *d_y, const T *d_x, const CUDPPSparseMatrixVectorMultiplyPlan *plan) |
| Perform matrix-vector multiply for sparse matrices and vectors of arbitrary size. | |
| void | allocSparseMatrixVectorMultiplyStorage (CUDPPSparseMatrixVectorMultiplyPlan *plan, const void *A, const unsigned int *rowindx, const unsigned int *indx) |
| Allocate intermediate product, flags and rowFindx (index of the last element of each row) array . | |
| void | freeSparseMatrixVectorMultiplyStorage (CUDPPSparseMatrixVectorMultiplyPlan *plan) |
| Deallocate intermediate product, flags and rowFindx (index of the last element of each row) array . | |
| void | cudppSparseMatrixVectorMultiplyDispatch (void *d_y, const void *d_x, const CUDPPSparseMatrixVectorMultiplyPlan *plan) |
| Dispatch function to perform a sparse matrix-vector multiply with the specified configuration. | |
| void calculatCompactLaunchParams | ( | const unsigned int | numElements, | |
| unsigned int & | numThreads, | |||
| unsigned int & | numBlocks, | |||
| unsigned int & | numEltsPerBlock | |||
| ) |
Calculate launch parameters for compactArray().
Calculates the block size and number of blocks from the total number of elements and the maximum threads per block. Called by compactArray().
The calculation is pretty straightforward - the number of blocks is calculated by dividing the number of input elements by the product of the number of threads in each CTA and the number of elements each thread will process. numThreads and numEltsPerBlock are also simple to calculate. Please note that in cases where numElements is not an exact multiple of SCAN_ELTS_PER_THREAD * CTA_SIZE we would have threads which do nothing or have a thread which will process less than SCAN_ELTS_PER_THREAD elements.
| [in] | numElements | Number of elements to sort |
| [out] | numThreads | Number of threads in each block |
| [out] | numBlocks | Number of blocks |
| [out] | numEltsPerBlock | Number of elements processed per block |
| void compactArray | ( | T * | d_out, | |
| size_t * | d_numValidElements, | |||
| const T * | d_in, | |||
| const unsigned int * | d_isValid, | |||
| size_t | numElements, | |||
| const CUDPPCompactPlan * | plan | |||
| ) | [inline] |
Compact the non-zero elements of an array.
Given an input array d_in, compactArray() outputs a compacted version which does not have null (zero) elements. Also ouputs the number of non-zero elements in the compacted array. Called by cudppCompactDispatch().
The algorithm is straightforward, involving two steps (most of the complexity is hidden in scan, invoked with cudppScanDispatch() ).
| [out] | d_out | Array of compacted non-null elements |
| [out] | d_numValidElements | Pointer to unsigned int to store number of non-null elements |
| [in] | d_in | Input array |
| [out] | d_isValid | Array of flags, 1 for each non-null element, 0 for each null element. Same length as d_in |
| [in] | numElements | Number of elements in input array |
| [in] | plan | Pointer to the plan object used for this compact |
| void allocCompactStorage | ( | CUDPPCompactPlan * | plan | ) |
Allocate intermediate arrays used by cudppCompact().
In addition to the internal CUDPPScanPlan contained in CUDPPCompactPlan, CUDPPCompact also needs a temporary device array of output indices, which is allocated by this function.
| plan | Pointer to CUDPPCompactPlan object within which intermediate storage is allocated. |
| void freeCompactStorage | ( | CUDPPCompactPlan * | plan | ) |
Deallocate intermediate storage used by cudppCompact().
Deallocates the output indices array allocated by allocCompactStorage().
| plan | Pointer to CUDPPCompactPlan object initialized by allocCompactStorage(). |
| void cudppCompactDispatch | ( | void * | d_out, | |
| size_t * | d_numValidElements, | |||
| const void * | d_in, | |||
| const unsigned int * | d_isValid, | |||
| size_t | numElements, | |||
| const CUDPPCompactPlan * | plan | |||
| ) |
Dispatch compactArray for the specified datatype.
A thin wrapper on top of compactArray which calls compactArray() for the data type specified in config. This is the app-level interface to compact used by cudppCompact().
| [out] | d_out | Compacted array of non-zero elements |
| [out] | d_numValidElements | Pointer to an unsigned int to store the number of non-zero elements |
| [in] | d_in | Input array |
| [in] | d_isValid | Array of boolean valid flags with same length as d_in |
| [in] | numElements | Number of elements to compact |
| [in] | plan | Pointer to plan object for this compact |
| void scanArrayRecursive | ( | T * | d_out, | |
| const T * | d_in, | |||
| T ** | d_blockSums, | |||
| size_t | numElements, | |||
| size_t | numRows, | |||
| const size_t * | rowPitches, | |||
| int | level | |||
| ) | [inline] |
Perform recursive scan on arbitrary size arrays.
This is the CPU-side workhorse function of the scan engine. This function invokes the CUDA kernels which perform the scan on individual blocks.
Scans of large arrays must be split (possibly recursively) into a hierarchy of block scans, where each block is scanned by a single CUDA thread block. At each recursive level of the scanArrayRecursive first invokes a kernel to scan all blocks of that level, and if the level has more than one block, it calls itself recursively. On returning from each recursive level, the total sum of each block from the level below is added to all elements of the corresponding block in this level. See "Parallel Prefix Sum (Scan) in CUDA" for more information (see References ).
Template parameter T is the datatype; isBackward specifies backward or forward scan; isExclusive specifies exclusive or inclusive scan, and op specifies the binary associative operator to be used.
| [out] | d_out | The output array for the scan results |
| [in] | d_in | The input array to be scanned |
| [out] | d_blockSums | Array of arrays of per-block sums (one array per recursive level, allocated by allocScanStorage()) |
| [in] | numElements | The number of elements in the array to scan |
| [in] | numRows | The number of rows in the array to scan |
| [in] | rowPitches | Array of row pitches (one array per recursive level, allocated by allocScanStorage()) |
| [in] | level | The current recursive level of the scan |
| void allocScanStorage | ( | CUDPPScanPlan * | plan | ) |
Allocate intermediate arrays used by scan.
Scans of large arrays must be split (possibly recursively) into a hierarchy of block scans, where each block is scanned by a single CUDA thread block. At each recursive level of the scan, we need an array in which to store the total sums of all blocks in that level. This function computes the amount of storage needed and allocates it.
| plan | Pointer to CUDPPScanPlan object containing options and number of elements, which is used to compute storage requirements, and within which intermediate storage is allocated. |
| void freeScanStorage | ( | CUDPPScanPlan * | plan | ) |
Deallocate intermediate block sums arrays in a CUDPPScanPlan object.
These arrays must have been allocated by allocScanStorage(), which is called by the constructor of cudppScanPlan().
| plan | Pointer to CUDPPScanPlan object initialized by allocScanStorage(). |
| void cudppScanDispatch | ( | void * | d_out, | |
| const void * | d_in, | |||
| size_t | numElements, | |||
| size_t | numRows, | |||
| const CUDPPScanPlan * | plan | |||
| ) |
Dispatch function to perform a scan (prefix sum) on an array with the specified configuration.
This is the dispatch routine which calls scanArrayRecursive() with appropriate template parameters and arguments to achieve the scan as specified in plan.
| [out] | d_out | The output array of scan results |
| [in] | d_in | The input array |
| [in] | numElements | The number of elements to scan |
| [in] | numRows | The number of rows to scan in parallel |
| [in] | plan | Pointer to CUDPPScanPlan object containing scan options and intermediate storage |
| void segmentedScanArrayRecursive | ( | T * | d_out, | |
| const T * | d_idata, | |||
| const unsigned int * | d_iflags, | |||
| T ** | d_blockSums, | |||
| unsigned int ** | d_blockFlags, | |||
| unsigned int ** | d_blockIndices, | |||
| int | numElements, | |||
| int | level | |||
| ) | [inline] |
Perform recursive scan on arbitrary size arrays.
This is the CPU-side workhorse function of the segmented scan engine. This function invokes the CUDA kernels which perform the segmented scan on individual blocks.
Scans of large arrays must be split (possibly recursively) into a hierarchy of block scans, where each block is scanned by a single CUDA thread block. At each recursive level of the segmentedScanArrayRecursive first invokes a kernel to scan all blocks of that level, and if the level has more than one block, it calls itself recursively. On returning from each recursive level, the total sum of each block from the level below is added to all elements of the first segment of the corresponding block in this level.
Template parameter T is the data type of the input data. Template parameter op is the binary operator of the segmented scan. Template parameter isBackward specifies whether the direction is backward (not implemented). It is forward if it is false. Template parameter isExclusive specifies whether the segmented scan is exclusive (true) or inclusive (false).
| [out] | d_out | The output array for the segmented scan results |
| [in] | d_idata | The input array to be scanned |
| [in] | d_iflags | The input flags vector which specifies the segments. The first element of a segment is marked by a 1 in the corresponding position in d_iflags vector. All other elements of d_iflags is 0. |
| [out] | d_blockSums | Array of arrays of per-block sums (one array per recursive level, allocated by allocScanStorage()) |
| [out] | d_blockFlags | Array of arrays of per-block OR-reductions of flags (one array per recursive level, allocated by allocScanStorage()) |
| [out] | d_blockIndices | Array of arrays of per-block min-reductions of indices (one array per recursive level, allocated by allocSegmentedScanStorage()). An index for a particular position i in a block is calculated as - if d_iflags[i] is set then it is the 1-based index of that position (i.e if d_iflags[10] is set then index is 11) otherwise the index is INT_MAX (the identity element of a min operator) |
| [in] | numElements | The number of elements in the array to scan |
| [in] | level | The current recursive level of the scan |
| void allocSegmentedScanStorage | ( | CUDPPSegmentedScanPlan * | plan | ) |
Allocate intermediate block sums, block flags and block indices arrays in a CUDPPSegmentedScanPlan class.
Segmented scans of large arrays must be split (possibly recursively) into a hierarchy of block segmented scans, where each block is scanned by a single CUDA thread block. At each recursive level of the scan, we need an array in which to store the total sums of all blocks in that level. Also at this level we have two more arrays - one which contains the OR-reductions of flags of all blocks at that level and the second which contains the min-reductions of indices of all blocks at that levels This function computes the amount of storage needed and allocates it.
| [in] | plan | Pointer to CUDPPSegmentedScanPlan object containing segmented scan options and number of elements, which is used to compute storage requirements. |
| void freeSegmentedScanStorage | ( | CUDPPSegmentedScanPlan * | plan | ) |
Deallocate intermediate block sums, block flags and block indices arrays in a CUDPPSegmentedScanPlan class.
These arrays must have been allocated by allocSegmentedScanStorage(), which is called by the constructor of CUDPPSegmentedScanPlan.
| [in] | plan | CUDPPSegmentedScanPlan class initialized by its constructor. |
| void cudppSegmentedScanDispatch | ( | void * | d_out, | |
| const void * | d_idata, | |||
| const unsigned int * | d_iflags, | |||
| int | numElements, | |||
| const CUDPPSegmentedScanPlan * | plan | |||
| ) |
Dispatch function to perform a scan (prefix sum) on an array with the specified configuration.
This is the dispatch routine which calls segmentedScanArrayRecursive() with appropriate template parameters and arguments to achieve the scan as specified in plan.
| [in] | numElements | The number of elements to scan |
| [in] | plan | Segmented Scan configuration (plan), initialized by CUDPPSegmentedScanPlan constructor |
| [in] | d_idata | The input array |
| [in] | d_iflags | The input flags array |
| [out] | d_out | The output array of segmented scan results |
| void populateSortPlan | ( | CUDPPSortPlan * | plan | ) | [inline] |
From the programmer-specified sort configuration, generate the internal CUDPPSortConfigInternal structure for performing the sort.
Calculates the block size and number of blocks from the total number of elements and the maximum threads in each block. Called by allocSortStorage().
| plan | Pointer to CUDPPSortPlan object |
| void allocSortStorage | ( | CUDPPSortPlan * | plan | ) |
Allocate intermediate storage for sort.
| plan | Pointer to CUDPPSortPlan object with sort configuration used to compute necessary storage and within which space will be allocated. |
| void freeSortStorage | ( | CUDPPSortPlan * | plan | ) |
Deallocate intermediate storage for sort.
| plan | Pointer to CUDPPSortPlan object which contains intermediate storage to be freed. |
| __host__ void sortChunks | ( | T * | d_out, | |
| const T * | d_in, | |||
| const CUDPPSortPlan * | plan | |||
| ) | [inline] |
Sorts input in chunks.
The chunk size automatically determined by populateSortConfig(). Each chunk is individually sorted, but chunks themselves are not sorted. Currently used only in radix sort.
Example: if the chunk size is 2 and the input is [ 5 3 6 2 7 1 ], the output is [ 3 5 2 6 1 7 ].
Internally, the input is divided into chunks and each chunk is sorted within a thread block where it can take advantage of shared memory. It makes one sort kernel call and expects radixsort_kernel() to properly handle chunks that are not full.
| [out] | d_out | Output data. |
| [in] | d_in | Input data. |
| [in] | plan | Configuration information for sort, created with cudppPlan(). |
| __host__ void mergeChunks | ( | T * | d_data, | |
| const CUDPPSortPlan * | plan | |||
| ) | [inline] |
Merges sorted chunks together into fully sorted output.
The chunk size is automatically determined by populateSortConfig(). Input must be individually sorted chunks.
Internally, we do log_2(total_size/initial_chunk_size) passes, pingponging between d_data and g_temp on each pass. Each pass generates half the number of chunks from the previous pass, with each chunk twice the size. If the number of passes is odd, we copy the result in g_temp back to d_data.
| [in,out] | d_data | Input and output data in global device memory (output is in the same array as input) |
| [in] | plan | Configuration information for sort, created with cudppPlan(). |
| void sortRadixGlobal | ( | T * | d_out, | |
| const T * | d_in, | |||
| size_t | numElements, | |||
| const CUDPPSortPlan * | plan | |||
| ) | [inline] |
Performs global radix sort, with no merge.
Currently only sorts unsigned ints. "Global radix sort" is a radix sort across the entire input, with one pass per bit, using a global split operation for each bit.
| [out] | d_out | Output data in global device memory |
| [in] | d_in | Input data in global device memory |
| [in] | numElements | Number of elements to be sorted |
| [in] | plan | Configuration information for sort, created with cudppPlan(). |
| void sortApp | ( | T * | d_out, | |
| const T * | d_in, | |||
| size_t | numElements, | |||
| const CUDPPSortPlan * | plan | |||
| ) | [inline] |
Main sort routine.
Templated on datatype.
| [out] | d_out | Output data in global device memory |
| [in] | d_in | Input data in global device memory. |
| [in] | numElements | Number of elements to be sorted. |
| [in] | plan | Configuration information for sort, created with cudppPlan(). |
| void cudppSortDispatch | ( | void * | d_out, | |
| const void * | d_in, | |||
| size_t | numElements, | |||
| const CUDPPSortPlan * | plan | |||
| ) |
Dispatch sortApp() for the specified datatype.
Expects data to be in d_in on device. Outputs to d_out. Dispatches to templated sortApp() depending on datatype in sc.
| [out] | d_out | Output data in global device memory. |
| [in] | d_in | Input data in global device memory. |
| [in] | numElements | Number of elements in d_in to sort. |
| [in] | plan | Configuration information for sort, created with cudppPlan(). |
| void sparseMatrixVectorMultiply | ( | T * | d_y, | |
| const T * | d_x, | |||
| const CUDPPSparseMatrixVectorMultiplyPlan * | plan | |||
| ) | [inline] |
Perform matrix-vector multiply for sparse matrices and vectors of arbitrary size.
This function performs the sparse matrix-vector multiply by executing four steps.
1. The sparseMatrixVectorFetchAndMultiply() kernel does an element-wise multiplication of a each element e in CUDPPSparseMatrixVectorMultiplyPlan::m_d_A with the corresponding (i.e. in the same row as the column index of e in CUDPPSparseMatrixVectorMultiplyPlan::m_d_A) element in d_x and stores the product in CUDPPSparseMatrixVectorMultiplyPlan::m_d_prod. It also sets all elements of CUDPPSparseMatrixVectorMultiplyPlan::m_d_flags to 0.
2. The sparseMatrixVectorSetFlags() kernel iterates over each element in CUDPPSparseMatrixVectorMultiplyPlan::m_d_rowIndex and sets the corresponding position (indicated by CUDPPSparseMatrixVectorMultiplyPlan::m_d_rowIndex) in CUDPPSparseMatrixVectorMultiplyPlan::m_d_flags to 1.
3. Perform a segmented scan on CUDPPSparseMatrixVectorMultiplyPlan::m_d_prod with CUDPPSparseMatrixVectorMultiplyPlan::m_d_flags as the flag vector. The output is stored in CUDPPSparseMatrixVectorMultiplyPlan::m_d_prod.
4. The yGather() kernel goes over each element in CUDPPSparseMatrixVectorMultiplyPlan::m_d_rowFinalIndex and picks the corresponding element (indicated by CUDPPSparseMatrixVectorMultiplyPlan::m_d_rowFinalIndex) element from CUDPPSparseMatrixVectorMultiplyPlan::m_d_prod and stores it in d_y.
| [out] | d_y | The output array for the sparse matrix-vector multiply (y vector) |
| [in] | d_x | The input x vector |
| [in] | plan | Pointer to the CUDPPSparseMatrixVectorMultiplyPlan object which stores the configuration and pointers to temporary buffers needed by this routine |
| void allocSparseMatrixVectorMultiplyStorage | ( | CUDPPSparseMatrixVectorMultiplyPlan * | plan, | |
| const void * | A, | |||
| const unsigned int * | rowindx, | |||
| const unsigned int * | indx | |||
| ) |
Allocate intermediate product, flags and rowFindx (index of the last element of each row) array .
| [in] | plan | Pointer to CUDPPSparseMatrixVectorMultiplyPlan class containing sparse matrix-vector multiply options, number of non-zero elements and number of rows which is used to compute storage requirements |
| [in] | A | The matrix A |
| [in] | rowindx | The indices of elements in A which are the first element of their row |
| [in] | indx | The column number for each element in A |
| void freeSparseMatrixVectorMultiplyStorage | ( | CUDPPSparseMatrixVectorMultiplyPlan * | plan | ) |
Deallocate intermediate product, flags and rowFindx (index of the last element of each row) array .
These arrays must have been allocated by allocSparseMatrixVectorMultiplyStorage(), which is called by the constructor of CUDPPSparseMatrixVectorMultiplyPlan.
| [in] | plan | Pointer to CUDPPSparseMatrixVectorMultiplyPlan plan initialized by its constructor. |
| void cudppSparseMatrixVectorMultiplyDispatch | ( | void * | d_y, | |
| const void * | d_x, | |||
| const CUDPPSparseMatrixVectorMultiplyPlan * | plan | |||
| ) |
Dispatch function to perform a sparse matrix-vector multiply with the specified configuration.
This is the dispatch routine which calls sparseMatrixVectorMultiply() with appropriate template parameters and arguments
| [out] | d_y | The output vector for y = A*x |
| [in] | d_x | The x vector for y = A*x |
| [in] | plan | The sparse matrix plan and data |
1.5.5