CUDPP 2.0
CUDA Data-Parallel Primitives Library
|
Algorithm Interface | |
CUDPP_DLL CUDPPResult | cudppScan (const CUDPPHandle planHandle, void *d_out, const void *d_in, size_t numElements) |
Performs a scan operation of numElements on its input in GPU memory (d_in) and places the output in GPU memory (d_out), with the scan parameters specified in the plan pointed to by planHandle. | |
CUDPP_DLL CUDPPResult | cudppSegmentedScan (const CUDPPHandle planHandle, void *d_out, const void *d_idata, const unsigned int *d_iflags, size_t numElements) |
Performs a segmented scan operation of numElements on its input in GPU memory (d_idata) and places the output in GPU memory (d_out), with the scan parameters specified in the plan pointed to by planHandle. | |
CUDPP_DLL CUDPPResult | cudppMultiScan (const CUDPPHandle planHandle, void *d_out, const void *d_in, size_t numElements, size_t numRows) |
Performs numRows parallel scan operations of numElements each on its input (d_in) and places the output in d_out, with the scan parameters set by config. Exactly like cudppScan except that it runs on multiple rows in parallel. | |
CUDPP_DLL CUDPPResult | cudppCompact (const CUDPPHandle planHandle, void *d_out, size_t *d_numValidElements, const void *d_in, const unsigned int *d_isValid, size_t numElements) |
Given an array d_in and an array of 1/0 flags in deviceValid, returns a compacted array in d_out of corresponding only the "valid" values from d_in. | |
CUDPP_DLL CUDPPResult | cudppReduce (const CUDPPHandle planHandle, void *d_out, const void *d_in, size_t numElements) |
Reduces an array to a single element using a binary associative operator. | |
CUDPP_DLL CUDPPResult | cudppSort (const CUDPPHandle planHandle, void *d_keys, void *d_values, size_t numElements) |
Sorts key-value pairs or keys only. | |
CUDPP_DLL CUDPPResult | cudppSparseMatrixVectorMultiply (const CUDPPHandle sparseMatrixHandle, void *d_y, const void *d_x) |
Perform matrix-vector multiply y = A*x for arbitrary sparse matrix A and vector x. | |
CUDPP_DLL CUDPPResult | cudppRand (const CUDPPHandle planHandle, void *d_out, size_t numElements) |
Rand puts numElements random 32-bit elements into d_out. | |
CUDPP_DLL CUDPPResult | cudppRandSeed (const CUDPPHandle planHandle, unsigned int seed) |
Sets the seed used for rand. | |
CUDPP_DLL CUDPPResult | cudppTridiagonal (CUDPPHandle planHandle, void *d_a, void *d_b, void *d_c, void *d_d, void *d_x, int systemSize, int numSystems) |
Solves tridiagonal linear systems. | |
Library Management Interface | |
CUDPP_DLL CUDPPResult | cudppCreate (CUDPPHandle *theCudpp) |
Creates an instance of the CUDPP library, and returns a handle. | |
CUDPP_DLL CUDPPResult | cudppDestroy (CUDPPHandle theCudpp) |
Destroys an instance of the CUDPP library given its handle. | |
Plan Interface | |
CUDPP_DLL CUDPPResult | cudppPlan (const CUDPPHandle cudppHandle, CUDPPHandle *planHandle, CUDPPConfiguration config, size_t numElements, size_t numRows, size_t rowPitch) |
Create a CUDPP plan. | |
CUDPP_DLL CUDPPResult | cudppDestroyPlan (CUDPPHandle planHandle) |
Destroy a CUDPP Plan. | |
CUDPP_DLL CUDPPResult | cudppSparseMatrix (const CUDPPHandle cudppHandle, CUDPPHandle *sparseMatrixHandle, CUDPPConfiguration config, size_t numNonZeroElements, size_t numRows, const void *A, const unsigned int *h_rowIndices, const unsigned int *h_indices) |
Create a CUDPP Sparse Matrix Object. | |
CUDPP_DLL CUDPPResult | cudppDestroySparseMatrix (CUDPPHandle sparseMatrixHandle) |
Destroy a CUDPP Sparse Matrix Object. | |
Hash Table Interface | |
const unsigned int | CUDPP_HASH_KEY_NOT_FOUND = CudaHT::CuckooHashing::kNotFound |
CUDPP_DLL CUDPPResult | cudppHashTable (CUDPPHandle cudppHandle, CUDPPHandle *plan, const CUDPPHashTableConfig *config) |
Creates a CUDPP hash table in GPU memory given an input hash table configuration; returns the plan for that hash table. | |
CUDPP_DLL CUDPPResult | cudppHashInsert (CUDPPHandle plan, const void *d_keys, const void *d_vals, size_t num) |
Inserts keys and values into a CUDPP hash table. | |
CUDPP_DLL CUDPPResult | cudppHashRetrieve (CUDPPHandle plan, const void *d_keys, void *d_vals, size_t num) |
Retrieves values, given keys, from a CUDPP hash table. | |
CUDPP_DLL CUDPPResult | cudppDestroyHashTable (CUDPPHandle cudppHandle, CUDPPHandle plan) |
Destroys a hash table given its handle. | |
CUDPP_DLL CUDPPResult | cudppMultivalueHashGetValuesSize (CUDPPHandle plan, unsigned int *size) |
Retrieves the size of the values array in a multivalue hash table. | |
CUDPP_DLL CUDPPResult | cudppMultivalueHashGetAllValues (CUDPPHandle plan, unsigned int **d_vals) |
Retrieves a pointer to the values array in a multivalue hash table. |
The CUDA public interface comprises the functions, structs, and enums defined in cudpp.h. Public interface functions call functions in the Application-Level interface. The public interface functions include Plan Interface functions and Algorithm Interface functions. Plan Interface functions are used for creating CUDPP Plan objects that contain configuration details, intermediate storage space, and in the case of cudppSparseMatrix(), data. The Algorithm Interface is the set of functions that do the real work of CUDPP, such as cudppScan() and cudppSparseMatrixVectorMultiply().
CUDPP_DLL CUDPPResult cudppScan | ( | const CUDPPHandle | planHandle, |
void * | d_out, | ||
const void * | d_in, | ||
size_t | numElements | ||
) |
Performs a scan operation of numElements on its input in GPU memory (d_in) and places the output in GPU memory (d_out), with the scan parameters specified in the plan pointed to by planHandle.
The input to a scan operation is an input array, a binary associative operator (like + or max), and an identity element for that operator (+'s identity is 0). The output of scan is the same size as its input. Informally, the output at each element is the result of operator applied to each input that comes before it. For instance, the output of sum-scan at each element is the sum of all the input elements before that input.
More formally, for associative operator ⊕ , outi = in0 ⊕ in1 ⊕ ... ⊕ ini-1.
CUDPP supports "exclusive" and "inclusive" scans. For the ADD operator, an exclusive scan computes the sum of all input elements before the current element, while an inclusive scan computes the sum of all input elements up to and including the current element.
Before calling scan, create an internal plan using cudppPlan().
After you are finished with the scan plan, clean up with cudppDestroyPlan().
[in] | planHandle | Handle to plan for this scan |
[out] | d_out | output of scan, in GPU memory |
[in] | d_in | input to scan, in GPU memory |
[in] | numElements | number of elements to scan |
CUDPP_DLL CUDPPResult cudppSegmentedScan | ( | const CUDPPHandle | planHandle, |
void * | d_out, | ||
const void * | d_idata, | ||
const unsigned int * | d_iflags, | ||
size_t | numElements | ||
) |
Performs a segmented scan operation of numElements on its input in GPU memory (d_idata) and places the output in GPU memory (d_out), with the scan parameters specified in the plan pointed to by planHandle.
The input to a segmented scan operation is an input array of data, an input array of flags which demarcate segments, a binary associative operator (like + or max), and an identity element for that operator (+'s identity is 0). The array of flags is the same length as the input with 1 marking the the first element of a segment and 0 otherwise. The output of segmented scan is the same size as its input. Informally, the output at each element is the result of operator applied to each input that comes before it in that segment. For instance, the output of segmented sum-scan at each element is the sum of all the input elements before that input in that segment.
More formally, for associative operator ⊕ , outi = ink ⊕ ink+1 ⊕ ... ⊕ ini-1. k is the index of the first element of the segment in which i lies
We support both "exclusive" and "inclusive" variants. For a segmented sum-scan, the exclusive variant computes the sum of all input elements before the current element in that segment, while the inclusive variant computes the sum of all input elements up to and including the current element, in that segment.
Before calling segmented scan, create an internal plan using cudppPlan().
After you are finished with the scan plan, clean up with cudppDestroyPlan().
[in] | planHandle | Handle to plan for this scan |
[out] | d_out | output of segmented scan, in GPU memory |
[in] | d_idata | input data to segmented scan, in GPU memory |
[in] | d_iflags | input flags to segmented scan, in GPU memory |
[in] | numElements | number of elements to perform segmented scan on |
CUDPP_DLL CUDPPResult cudppMultiScan | ( | const CUDPPHandle | planHandle, |
void * | d_out, | ||
const void * | d_in, | ||
size_t | numElements, | ||
size_t | numRows | ||
) |
Performs numRows parallel scan operations of numElements each on its input (d_in) and places the output in d_out, with the scan parameters set by config. Exactly like cudppScan except that it runs on multiple rows in parallel.
Note that to achieve good performance with cudppMultiScan one should allocate the device arrays passed to it so that all rows are aligned to the correct boundaries for the architecture the app is running on. The easy way to do this is to use cudaMallocPitch() to allocate a 2D array on the device. Use the rowPitch parameter to cudppPlan() to specify this pitch. The easiest way is to pass the device pitch returned by cudaMallocPitch to cudppPlan() via rowPitch.
[in] | planHandle | handle to CUDPPScanPlan |
[out] | d_out | output of scan, in GPU memory |
[in] | d_in | input to scan, in GPU memory |
[in] | numElements | number of elements (per row) to scan |
[in] | numRows | number of rows to scan in parallel |
CUDPP_DLL CUDPPResult cudppCompact | ( | const CUDPPHandle | planHandle, |
void * | d_out, | ||
size_t * | d_numValidElements, | ||
const void * | d_in, | ||
const unsigned int * | d_isValid, | ||
size_t | numElements | ||
) |
Given an array d_in and an array of 1/0 flags in deviceValid, returns a compacted array in d_out of corresponding only the "valid" values from d_in.
Takes as input an array of elements in GPU memory (d_in) and an equal-sized unsigned int array in GPU memory (deviceValid) that indicate which of those input elements are valid. The output is a packed array, in GPU memory, of only those elements marked as valid.
Internally, uses cudppScan.
Example:
d_in = [ a b c d e f ] deviceValid = [ 1 0 1 1 0 1 ] d_out = [ a c d f ]
[in] | planHandle | handle to CUDPPCompactPlan |
[out] | d_out | compacted output |
[out] | d_numValidElements | set during cudppCompact; is set with the number of elements valid flags in the d_isValid input array |
[in] | d_in | input to compact |
[in] | d_isValid | which elements in d_in are valid |
[in] | numElements | number of elements in d_in |
CUDPP_DLL CUDPPResult cudppReduce | ( | const CUDPPHandle | planHandle, |
void * | d_out, | ||
const void * | d_in, | ||
size_t | numElements | ||
) |
Reduces an array to a single element using a binary associative operator.
For example, if the operator is CUDPP_ADD, then:
d_in = [ 3 2 0 1 -4 5 0 -1 ] d_out = [ 6 ]
If the operator is CUDPP_MIN, then:
d_in = [ 3 2 0 1 -4 5 0 -1 ] d_out = [ -4 ]
Limits: numElements must be at least 1, and is currently limited only by the addressable memory in CUDA (and the output accuracy is limited by numerical precision).
[in] | planHandle | handle to CUDPPReducePlan |
[out] | d_out | Output of reduce (a single element) in GPU memory. Must be a pointer to an array of at least a single element. |
[in] | d_in | Input array to reduce in GPU memory. Must be a pointer to an array of at least numElements elements. |
[in] | numElements | the number of elements to reduce. |
CUDPP_DLL CUDPPResult cudppSort | ( | const CUDPPHandle | planHandle, |
void * | d_keys, | ||
void * | d_values, | ||
size_t | numElements | ||
) |
Sorts key-value pairs or keys only.
Takes as input an array of keys in GPU memory (d_keys) and an optional array of corresponding values, and outputs sorted arrays of keys and (optionally) values in place. Key-value and key-only sort is selected through the configuration of the plan, using the options CUDPP_OPTION_KEYS_ONLY and CUDPP_OPTION_KEY_VALUE_PAIRS.
Supported key types are CUDPP_FLOAT and CUDPP_UINT. Values can be any 32-bit type (internally, values are treated only as a payload and cast to unsigned int).
[in] | planHandle | handle to CUDPPSortPlan |
[out] | d_keys | keys by which key-value pairs will be sorted |
[in] | d_values | values to be sorted |
[in] | numElements | number of elements in d_keys and d_values |
CUDPP_DLL CUDPPResult cudppSparseMatrixVectorMultiply | ( | const CUDPPHandle | sparseMatrixHandle, |
void * | d_y, | ||
const void * | d_x | ||
) |
Perform matrix-vector multiply y = A*x for arbitrary sparse matrix A and vector x.
Given a matrix object handle (which has been initialized using cudppSparseMatrix()), This function multiplies the input vector d_x by the matrix referred to by sparseMatrixHandle, returning the result in d_y.
sparseMatrixHandle | Handle to a sparse matrix object created with cudppSparseMatrix() |
d_y | The output vector, y |
d_x | The input vector, x |
CUDPP_DLL CUDPPResult cudppRand | ( | const CUDPPHandle | planHandle, |
void * | d_out, | ||
size_t | numElements | ||
) |
Rand puts numElements random 32-bit elements into d_out.
Outputs numElements random values to d_out. d_out must be of type unsigned int, allocated in device memory.
The algorithm used for the random number generation is stored in planHandle. Depending on the specification of the pseudo random number generator(PRNG), the generator may have one or more seeds. To set the seed, use cudppRandSeed().
[in] | planHandle | Handle to plan for rand |
[in] | numElements | number of elements in d_out. |
[out] | d_out | output of rand, in GPU memory. Should be an array of unsigned integers. |
CUDPP_DLL CUDPPResult cudppRandSeed | ( | const CUDPPHandle | planHandle, |
unsigned int | seed | ||
) |
Sets the seed used for rand.
The seed is crucial to any random number generator as it allows a sequence of random numbers to be replicated. Since there may be multiple different rand algorithms in CUDPP, cudppRandSeed uses planHandle to determine which seed to set. Each rand algorithm has its own unique set of seeds depending on what the algorithm needs.
[in] | planHandle | the handle to the plan which specifies which rand seed to set |
[in] | seed | the value which the internal cudpp seed will be set to |
CUDPP_DLL CUDPPResult cudppTridiagonal | ( | CUDPPHandle | planHandle, |
void * | d_a, | ||
void * | d_b, | ||
void * | d_c, | ||
void * | d_d, | ||
void * | d_x, | ||
int | systemSize, | ||
int | numSystems | ||
) |
Solves tridiagonal linear systems.
The solver uses a hybrid CR-PCR algorithm described in our papers "Fast Fast Tridiagonal Solvers on the GPU" and "A Hybrid Method for Solving Tridiagonal Systems on the GPU". (See the References bibliography). Please refer to the papers for a complete description of the basic CR (Cyclic Reduction) and PCR (Parallel Cyclic Reduction) algorithms and their hybrid variants.
[out] | d_x | Solution vector |
[in] | planHandle | Handle to plan for tridiagonal solver |
[in] | d_a | Lower diagonal |
[in] | d_b | Main diagonal |
[in] | d_c | Upper diagonal |
[in] | d_d | Right hand side |
[in] | systemSize | The size of the linear system |
[in] | numSystems | The number of systems to be solved |
CUDPP_DLL CUDPPResult cudppCreate | ( | CUDPPHandle * | theCudpp | ) |
Creates an instance of the CUDPP library, and returns a handle.
cudppCreate() must be called before any other CUDPP function. In a multi-GPU application that uses multiple CUDA context, cudppCreate() must be called once for each CUDA context. Each call returns a different handle, because each CUDA context (and the host thread that owns it) must use a separate instance of the CUDPP library.
[in,out] | theCudpp | a pointer to the CUDPPHandle for the created CUDPP instance. |
CUDPP_DLL CUDPPResult cudppDestroy | ( | CUDPPHandle | theCudpp | ) |
Destroys an instance of the CUDPP library given its handle.
cudppDestroy() should be called once for each handle created using cudppCreate(), to ensure proper resource cleanup of all library instances.
[in] | theCudpp | the handle to the CUDPP instance to destroy. |
CUDPP_DLL CUDPPResult cudppPlan | ( | const CUDPPHandle | cudppHandle, |
CUDPPHandle * | planHandle, | ||
CUDPPConfiguration | config, | ||
size_t | numElements, | ||
size_t | numRows, | ||
size_t | rowPitch | ||
) |
Create a CUDPP plan.
A plan is a data structure containing state and intermediate storage space that CUDPP uses to execute algorithms on data. A plan is created by passing to cudppPlan() a CUDPPConfiguration that specifies the algorithm, operator, datatype, and options. The size of the data must also be passed to cudppPlan(), in the numElements, numRows, and rowPitch arguments. These sizes are used to allocate internal storage space at the time the plan is created. The CUDPP planner may use the sizes, options, and information about the present hardware to choose optimal settings.
Note that numElements is the maximum size of the array to be processed with this plan. That means that a plan may be re-used to process (for example, to sort or scan) smaller arrays.
[out] | planHandle | A pointer to an opaque handle to the internal plan |
[in] | cudppHandle | A handle to an instance of the CUDPP library used for resource management |
[in] | config | The configuration struct specifying algorithm and options |
[in] | numElements | The maximum number of elements to be processed |
[in] | numRows | The number of rows (for 2D operations) to be processed |
[in] | rowPitch | The pitch of the rows of input data, in elements |
CUDPP_DLL CUDPPResult cudppDestroyPlan | ( | CUDPPHandle | planHandle | ) |
Destroy a CUDPP Plan.
Deletes the plan referred to by planHandle and all associated internal storage.
[in] | planHandle | The CUDPPHandle to the plan to be destroyed |
CUDPP_DLL CUDPPResult cudppSparseMatrix | ( | const CUDPPHandle | cudppHandle, |
CUDPPHandle * | sparseMatrixHandle, | ||
CUDPPConfiguration | config, | ||
size_t | numNonZeroElements, | ||
size_t | numRows, | ||
const void * | A, | ||
const unsigned int * | h_rowIndices, | ||
const unsigned int * | h_indices | ||
) |
Create a CUDPP Sparse Matrix Object.
The sparse matrix plan is a data structure containing state and intermediate storage space that CUDPP uses to perform sparse matrix dense vector multiply. This plan is created by passing to CUDPPSparseMatrixVectorMultiplyPlan() a CUDPPConfiguration that specifies the algorithm (sprarse matrix-dense vector multiply) and datatype, along with the sparse matrix itself in CSR format. The number of non-zero elements in the sparse matrix must also be passed as numNonZeroElements. This is used to allocate internal storage space at the time the sparse matrix plan is created.
[out] | sparseMatrixHandle | A pointer to an opaque handle to the sparse matrix object |
[in] | cudppHandle | A handle to an instance of the CUDPP library used for resource management |
[in] | config | The configuration struct specifying algorithm and options |
[in] | numNonZeroElements | The number of non zero elements in the sparse matrix |
[in] | numRows | This is the number of rows in y, x and A for y = A * x |
[in] | A | The matrix data |
[in] | h_rowIndices | An array containing the index of the start of each row in A |
[in] | h_indices | An array containing the index of each nonzero element in A |
CUDPP_DLL CUDPPResult cudppDestroySparseMatrix | ( | CUDPPHandle | sparseMatrixHandle | ) |
Destroy a CUDPP Sparse Matrix Object.
Deletes the sparse matrix data and plan referred to by sparseMatrixHandle and all associated internal storage.
[in] | sparseMatrixHandle | The CUDPPHandle to the matrix object to be destroyed |
CUDPP_DLL CUDPPResult cudppHashTable | ( | CUDPPHandle | cudppHandle, |
CUDPPHandle * | plan, | ||
const CUDPPHashTableConfig * | config | ||
) |
Creates a CUDPP hash table in GPU memory given an input hash table configuration; returns the plan for that hash table.
Requires a CUDPPHandle for the CUDPP instance (to ensure thread safety); call cudppCreate() to get this handle.
The hash table implementation requires hardware capability 2.0 or higher (64-bit atomic operations).
Hash table types and input parameters are discussed in CUDPPHashTableType and CUDPPHashTableConfig.
After you are finished with the hash table, clean up with cudppDestroyHashTable().
See Overview of CUDPP hash tables for an overview of CUDPP's hash table support.
[in] | cudppHandle | Handle to CUDPP instance |
[out] | plan | Handle to hash table instance |
[in] | config | Configuration for hash table to be created |
CUDPP_DLL CUDPPResult cudppHashInsert | ( | CUDPPHandle | plan, |
const void * | d_keys, | ||
const void * | d_vals, | ||
size_t | num | ||
) |
Inserts keys and values into a CUDPP hash table.
Requires a CUDPPHandle for the hash table instance; call cudppHashTable() to create the hash table and get this handle.
d_keys and d_values should be in GPU memory. These should be pointers to arrays of unsigned ints.
Calls HashTable::Build internally.
See Overview of CUDPP hash tables for an overview of CUDPP's hash table support.
[in] | plan | Handle to hash table instance |
[in] | d_keys | GPU pointer to keys to be inserted |
[in] | d_vals | GPU pointer to values to be inserted |
[in] | num | Number of keys/values to be inserted |
CUDPP_DLL CUDPPResult cudppHashRetrieve | ( | CUDPPHandle | plan, |
const void * | d_keys, | ||
void * | d_vals, | ||
size_t | num | ||
) |
Retrieves values, given keys, from a CUDPP hash table.
Requires a CUDPPHandle for the hash table instance; call cudppHashTable() to create the hash table and get this handle.
d_keys and d_values should be in GPU memory. These should be pointers to arrays of unsigned ints.
Calls HashTable::Retrieve internally.
See Overview of CUDPP hash tables for an overview of CUDPP's hash table support.
[in] | plan | Handle to hash table instance |
[in] | d_keys | GPU pointer to keys to be retrieved |
[out] | d_vals | GPU pointer to values to be retrieved |
[in] | num | Number of keys/values to be retrieved |
CUDPP_DLL CUDPPResult cudppDestroyHashTable | ( | CUDPPHandle | cudppHandle, |
CUDPPHandle | plan | ||
) |
Destroys a hash table given its handle.
Requires a CUDPPHandle for the CUDPP instance (to ensure thread safety); call cudppCreate() to get this handle.
Requires a CUDPPHandle for the hash table instance; call cudppHashTable() to get this handle.
See Overview of CUDPP hash tables for an overview of CUDPP's hash table support.
[in] | cudppHandle | Handle to CUDPP instance |
[in] | plan | Handle to hash table instance |
CUDPP_DLL CUDPPResult cudppMultivalueHashGetValuesSize | ( | CUDPPHandle | plan, |
unsigned int * | size | ||
) |
Retrieves the size of the values array in a multivalue hash table.
Only relevant for multivalue hash tables.
Requires a CUDPPHandle for the hash table instance; call cudppHashTable() to get this handle.
See Overview of CUDPP hash tables for an overview of CUDPP's hash table support.
[in] | plan | Handle to hash table instance |
[out] | size | Pointer to size of multivalue hash table |
CUDPP_DLL CUDPPResult cudppMultivalueHashGetAllValues | ( | CUDPPHandle | plan, |
unsigned int ** | d_vals | ||
) |
Retrieves a pointer to the values array in a multivalue hash table.
Only relevant for multivalue hash tables.
Requires a CUDPPHandle for the hash table instance; call cudppHashTable() to get this handle.
See Overview of CUDPP hash tables for an overview of CUDPP's hash table support.
[in] | plan | Handle to hash table instance |
[out] | d_vals | Pointer to pointer of values (in GPU memory) |