Algorithm Interface
CUDPP_DLL CUDPPResult	cudppScan (const CUDPPHandle planHandle, void d_out, const void d_in, size_t numElements)
	Performs a scan operation of numElements on its input in GPU memory (d_in) and places the output in GPU memory (d_out), with the scan parameters specified in the plan pointed to by planHandle. More...

CUDPP_DLL CUDPPResult	cudppSegmentedScan (const CUDPPHandle planHandle, void d_out, const void d_idata, const unsigned int *d_iflags, size_t numElements)
	Performs a segmented scan operation of numElements on its input in GPU memory (d_idata) and places the output in GPU memory (d_out), with the scan parameters specified in the plan pointed to by planHandle. More...

CUDPP_DLL CUDPPResult	cudppMultiScan (const CUDPPHandle planHandle, void d_out, const void d_in, size_t numElements, size_t numRows)
	Performs numRows parallel scan operations of numElements each on its input (d_in) and places the output in d_out, with the scan parameters set by config. Exactly like cudppScan except that it runs on multiple rows in parallel. More...

CUDPP_DLL CUDPPResult	cudppCompact (const CUDPPHandle planHandle, void d_out, size_t d_numValidElements, const void d_in, const unsigned int d_isValid, size_t numElements)
	Given an array d_in and an array of 1/0 flags in deviceValid, returns a compacted array in d_out of corresponding only the "valid" values from d_in. More...

CUDPP_DLL CUDPPResult	cudppReduce (const CUDPPHandle planHandle, void d_out, const void d_in, size_t numElements)
	Reduces an array to a single element using a binary associative operator. More...

CUDPP_DLL CUDPPResult	cudppRadixSort (const CUDPPHandle planHandle, void d_keys, void d_values, size_t numElements)
	Sorts key-value pairs or keys only. More...

CUDPP_DLL CUDPPResult	cudppMergeSort (const CUDPPHandle planHandle, void d_keys, void d_values, size_t numElements)
	Sorts key-value pairs or keys only. More...

CUDPP_DLL CUDPPResult	cudppStringSortAligned (const CUDPPHandle planHandle, unsigned int d_keys, unsigned int d_values, unsigned int *stringVals, size_t numElements, size_t stringArrayLength)
	Sorts strings. Keys are the first four characters of the string, and values are the addresses where the strings reside in memory (stringVals) More...

CUDPP_DLL CUDPPResult	cudppStringSort (const CUDPPHandle planHandle, unsigned char d_stringVals, unsigned int d_address, unsigned char termC, size_t numElements, size_t stringArrayLength)
	Sorts strings. Keys are the first four characters of the string, and values are the addresses where the strings reside in memory (stringVals) More...

CUDPP_DLL CUDPPResult	cudppSparseMatrixVectorMultiply (const CUDPPHandle sparseMatrixHandle, void d_y, const void d_x)
	Perform matrix-vector multiply y = A*x for arbitrary sparse matrix A and vector x. More...

CUDPP_DLL CUDPPResult	cudppRand (const CUDPPHandle planHandle, void *d_out, size_t numElements)
	Rand puts numElements random 32-bit elements into d_out. More...

CUDPP_DLL CUDPPResult	cudppRandSeed (const CUDPPHandle planHandle, unsigned int seed)
	Sets the seed used for rand. More...

CUDPP_DLL CUDPPResult	cudppTridiagonal (CUDPPHandle planHandle, void d_a, void d_b, void d_c, void d_d, void *d_x, int systemSize, int numSystems)
	Solves tridiagonal linear systems. More...

CUDPP_DLL CUDPPResult	cudppCompress (CUDPPHandle planHandle, unsigned char d_uncompressed, int d_bwtIndex, unsigned int d_histSize, unsigned int d_hist, unsigned int d_encodeOffset, unsigned int d_compressedSize, unsigned int *d_compressed, size_t numElements)
	Compresses data stream. More...

CUDPP_DLL CUDPPResult	cudppBurrowsWheelerTransform (CUDPPHandle planHandle, unsigned char d_in, unsigned char d_out, int *d_index, size_t numElements)
	Performs the Burrows-Wheeler Transform. More...

CUDPP_DLL CUDPPResult	cudppMoveToFrontTransform (CUDPPHandle planHandle, unsigned char d_in, unsigned char d_out, size_t numElements)
	Performs the Move-to-Front Transform. More...

CUDPP_DLL CUDPPResult	cudppListRank (CUDPPHandle planHandle, void d_ranked_values, void d_unranked_values, void *d_next_indices, size_t head, size_t numElements)
	Performs list ranking of linked list node values. More...

CUDPP_DLL CUDPPResult	cudppSuffixArray (CUDPPHandle planHandle, unsigned char d_in, unsigned int d_out, size_t numElements)
	Performs the Suffix Array. More...

CUDPP_DLL CUDPPResult	cudppMultiSplit (const CUDPPHandle planHandle, unsigned int d_keys, unsigned int d_values, size_t numElements, size_t numBuckets)
	Splits an array of keys and an optional array of values into a set of buckets. More...

CUDPP_DLL CUDPPResult	cudppMultiSplitCustomBucketMapper (const CUDPPHandle planHandle, unsigned int d_keys, unsigned int d_values, size_t numElements, size_t numBuckets, BucketMappingFunc bucketMappingFunc)
	Splits an array of keys and an optional array of values into a set of buckets using a custom function to map elements to buckets. More...

Library Management Interface
CUDPP_DLL CUDPPResult	cudppCreate (CUDPPHandle *theCudpp)
	Creates an instance of the CUDPP library, and returns a handle. More...

CUDPP_DLL CUDPPResult	cudppDestroy (CUDPPHandle theCudpp)
	Destroys an instance of the CUDPP library given its handle. More...

Plan Interface
CUDPP_DLL CUDPPResult	cudppPlan (const CUDPPHandle cudppHandle, CUDPPHandle *planHandle, CUDPPConfiguration config, size_t numElements, size_t numRows, size_t rowPitch)
	Create a CUDPP plan. More...

CUDPP_DLL CUDPPResult	cudppDestroyPlan (CUDPPHandle planHandle)
	Destroy a CUDPP Plan. More...

CUDPP_DLL CUDPPResult	cudppSparseMatrix (const CUDPPHandle cudppHandle, CUDPPHandle sparseMatrixHandle, CUDPPConfiguration config, size_t numNonZeroElements, size_t numRows, const void A, const unsigned int h_rowIndices, const unsigned int h_indices)
	Create a CUDPP Sparse Matrix Object. More...

CUDPP_DLL CUDPPResult	cudppDestroySparseMatrix (CUDPPHandle sparseMatrixHandle)
	Destroy a CUDPP Sparse Matrix Object. More...

Hash Table Interface
const unsigned int	CUDPP_HASH_KEY_NOT_FOUND = CudaHT::CuckooHashing::kNotFound

CUDPP_DLL CUDPPResult	cudppHashTable (CUDPPHandle cudppHandle, CUDPPHandle plan, const CUDPPHashTableConfig config)
	Creates a CUDPP hash table in GPU memory given an input hash table configuration; returns the plan for that hash table. More...

CUDPP_DLL CUDPPResult	cudppHashInsert (CUDPPHandle plan, const void d_keys, const void d_vals, size_t num)
	Inserts keys and values into a CUDPP hash table. More...

CUDPP_DLL CUDPPResult	cudppHashRetrieve (CUDPPHandle plan, const void d_keys, void d_vals, size_t num)
	Retrieves values, given keys, from a CUDPP hash table. More...

CUDPP_DLL CUDPPResult	cudppDestroyHashTable (CUDPPHandle cudppHandle, CUDPPHandle plan)
	Destroys a hash table given its handle. More...

CUDPP_DLL CUDPPResult	cudppMultivalueHashGetValuesSize (CUDPPHandle plan, unsigned int *size)
	Retrieves the size of the values array in a multivalue hash table. More...

CUDPP_DLL CUDPPResult	cudppMultivalueHashGetAllValues (CUDPPHandle plan, unsigned int **d_vals)
	Retrieves a pointer to the values array in a multivalue hash table. More...

Detailed Description

The CUDA public interface comprises the functions, structs, and enums defined in cudpp.h. Public interface functions call functions in the Application-Level interface. The public interface functions include Plan Interface functions and Algorithm Interface functions. Plan Interface functions are used for creating CUDPP Plan objects that contain configuration details, intermediate storage space, and in the case of cudppSparseMatrix(), data. The Algorithm Interface is the set of functions that do the real work of CUDPP, such as cudppScan() and cudppSparseMatrixVectorMultiply().

Function Documentation

CUDPP_DLL CUDPPResult cudppScan	(	const CUDPPHandle	planHandle,
		void *	d_out,
		const void *	d_in,
		size_t	numElements
	)

Performs a scan operation of numElements on its input in GPU memory (d_in) and places the output in GPU memory (d_out), with the scan parameters specified in the plan pointed to by planHandle.

The input to a scan operation is an input array, a binary associative operator (like + or max), and an identity element for that operator (+'s identity is 0). The output of scan is the same size as its input. Informally, the output at each element is the result of operator applied to each input that comes before it. For instance, the output of sum-scan at each element is the sum of all the input elements before that input.

More formally, for associative operator ⊕, out_i = in₀ ⊕ in₁ ⊕ ... ⊕ in_i-1.

CUDPP supports "exclusive" and "inclusive" scans. For the ADD operator, an exclusive scan computes the sum of all input elements before the current element, while an inclusive scan computes the sum of all input elements up to and including the current element.

Before calling scan, create an internal plan using cudppPlan().

After you are finished with the scan plan, clean up with cudppDestroyPlan().

Parameters

[in]	planHandle	Handle to plan for this scan
[out]	d_out	output of scan, in GPU memory
[in]	d_in	input to scan, in GPU memory
[in]	numElements	number of elements to scan

Returns: CUDPPResult indicating success or error condition

See also: cudppPlan, cudppDestroyPlan

CUDPP_DLL CUDPPResult cudppSegmentedScan	(	const CUDPPHandle	planHandle,
		void *	d_out,
		const void *	d_idata,
		const unsigned int *	d_iflags,
		size_t	numElements
	)

Performs a segmented scan operation of numElements on its input in GPU memory (d_idata) and places the output in GPU memory (d_out), with the scan parameters specified in the plan pointed to by planHandle.

The input to a segmented scan operation is an input array of data, an input array of flags which demarcate segments, a binary associative operator (like + or max), and an identity element for that operator (+'s identity is 0). The array of flags is the same length as the input with 1 marking the the first element of a segment and 0 otherwise. The output of segmented scan is the same size as its input. Informally, the output at each element is the result of operator applied to each input that comes before it in that segment. For instance, the output of segmented sum-scan at each element is the sum of all the input elements before that input in that segment.

More formally, for associative operator ⊕, out_i = in_k ⊕ in_k+1 ⊕ ... ⊕ in_i-1. k is the index of the first element of the segment in which i lies.

We support both "exclusive" and "inclusive" variants. For a segmented sum-scan, the exclusive variant computes the sum of all input elements before the current element in that segment, while the inclusive variant computes the sum of all input elements up to and including the current element, in that segment.

Before calling segmented scan, create an internal plan using cudppPlan().

After you are finished with the scan plan, clean up with cudppDestroyPlan().

Parameters

[in]	planHandle	Handle to plan for this scan
[out]	d_out	output of segmented scan, in GPU memory
[in]	d_idata	input data to segmented scan, in GPU memory
[in]	d_iflags	input flags to segmented scan, in GPU memory
[in]	numElements	number of elements to perform segmented scan on

Returns: CUDPPResult indicating success or error condition

See also: cudppPlan, cudppDestroyPlan

CUDPP_DLL CUDPPResult cudppMultiScan	(	const CUDPPHandle	planHandle,
		void *	d_out,
		const void *	d_in,
		size_t	numElements,
		size_t	numRows
	)

Performs numRows parallel scan operations of numElements each on its input (d_in) and places the output in d_out, with the scan parameters set by config. Exactly like cudppScan except that it runs on multiple rows in parallel.

Note that to achieve good performance with cudppMultiScan one should allocate the device arrays passed to it so that all rows are aligned to the correct boundaries for the architecture the app is running on. The easy way to do this is to use cudaMallocPitch() to allocate a 2D array on the device. Use the rowPitch parameter to cudppPlan() to specify this pitch. The easiest way is to pass the device pitch returned by cudaMallocPitch to cudppPlan() via rowPitch.

Parameters

[in]	planHandle	handle to CUDPPScanPlan
[out]	d_out	output of scan, in GPU memory
[in]	d_in	input to scan, in GPU memory
[in]	numElements	number of elements (per row) to scan
[in]	numRows	number of rows to scan in parallel

Returns: CUDPPResult indicating success or error condition

See also: cudppScan, cudppPlan

CUDPP_DLL CUDPPResult cudppCompact	(	const CUDPPHandle	planHandle,
		void *	d_out,
		size_t *	d_numValidElements,
		const void *	d_in,
		const unsigned int *	d_isValid,
		size_t	numElements
	)

Given an array d_in and an array of 1/0 flags in deviceValid, returns a compacted array in d_out of corresponding only the "valid" values from d_in.

Takes as input an array of elements in GPU memory (d_in) and an equal-sized unsigned int array in GPU memory (deviceValid) that indicate which of those input elements are valid. The output is a packed array, in GPU memory, of only those elements marked as valid.

Internally, uses cudppScan.

Example:

 d_in    = [ a b c d e f ]
 deviceValid = [ 1 0 1 1 0 1 ]
 d_out   = [ a c d f ]

Todo:: [MJH] We need to evaluate whether cudppCompact should be a core member of the public interface. It's not clear to me that what the user always wants is a final compacted array. Often one just wants the array of indices to which each input element should go in the output. The split() routine used in radix sort might make more sense to expose.

Parameters

[in]	planHandle	handle to CUDPPCompactPlan
[out]	d_out	compacted output
[out]	d_numValidElements	set during cudppCompact; is set with the number of elements valid flags in the d_isValid input array
[in]	d_in	input to compact
[in]	d_isValid	which elements in d_in are valid
[in]	numElements	number of elements in d_in

Returns: CUDPPResult indicating success or error condition

CUDPP_DLL CUDPPResult cudppReduce	(	const CUDPPHandle	planHandle,
		void *	d_out,
		const void *	d_in,
		size_t	numElements
	)

Reduces an array to a single element using a binary associative operator.

For example, if the operator is CUDPP_ADD, then:

1 d_in = [ 3 2 0 1 -4 5 0 -1 ]

2 d_out = [ 6 ]

If the operator is CUDPP_MIN, then:

1 d_in = [ 3 2 0 1 -4 5 0 -1 ]

2 d_out = [ -4 ]

Limits: numElements must be at least 1, and is currently limited only by the addressable memory in CUDA (and the output accuracy is limited by numerical precision).

Parameters

[in]	planHandle	handle to CUDPPReducePlan
[out]	d_out	Output of reduce (a single element) in GPU memory. Must be a pointer to an array of at least a single element.
[in]	d_in	Input array to reduce in GPU memory. Must be a pointer to an array of at least numElements elements.
[in]	numElements	the number of elements to reduce.

Returns: CUDPPResult indicating success or error condition

See also: cudppPlan

CUDPP_DLL CUDPPResult cudppRadixSort	(	const CUDPPHandle	planHandle,
		void *	d_keys,
		void *	d_values,
		size_t	numElements
	)

Sorts key-value pairs or keys only.

Takes as input an array of keys in GPU memory (d_keys) and an optional array of corresponding values, and outputs sorted arrays of keys and (optionally) values in place. Radix sort or Merge sort is selected through the configuration (.algorithm) Key-value and key-only sort is selected through the configuration of the plan, using the options CUDPP_OPTION_KEYS_ONLY and CUDPP_OPTION_KEY_VALUE_PAIRS.

Supported key types are CUDPP_FLOAT and CUDPP_UINT. Values can be any 32-bit type (internally, values are treated only as a payload and cast to unsigned int).

Todo:: Determine if we need to provide an "out of place" sort interface.

Parameters

[in]	planHandle	handle to CUDPPSortPlan
[out]	d_keys	keys by which key-value pairs will be sorted
[in]	d_values	values to be sorted
[in]	numElements	number of elements in d_keys and d_values

Returns: CUDPPResult indicating success or error condition

See also: cudppPlan, CUDPPConfiguration, CUDPPAlgorithm

CUDPP_DLL CUDPPResult cudppMergeSort	(	const CUDPPHandle	planHandle,
		void *	d_keys,
		void *	d_values,
		size_t	numElements
	)

Sorts key-value pairs or keys only.

Takes as input an array of keys in GPU memory (d_keys) and an optional array of corresponding values, and outputs sorted arrays of keys and (optionally) values in place. Radix sort or Merge sort is selected through the configuration (.algorithm) Key-value and key-only sort is selected through the configuration of the plan, using the options CUDPP_OPTION_KEYS_ONLY and CUDPP_OPTION_KEY_VALUE_PAIRS.

Supported key types are CUDPP_FLOAT and CUDPP_UINT. Values can be any 32-bit type (internally, values are treated only as a payload and cast to unsigned int).

Todo:: Determine if we need to provide an "out of place" sort interface.

Parameters

[in]	planHandle	handle to CUDPPSortPlan
[out]	d_keys	keys by which key-value pairs will be sorted
[in]	d_values	values to be sorted
[in]	numElements	number of elements in d_keys and d_values

Returns: CUDPPResult indicating success or error condition

See also: cudppPlan, CUDPPConfiguration, CUDPPAlgorithm

CUDPP_DLL CUDPPResult cudppStringSortAligned	(	const CUDPPHandle	planHandle,
		unsigned int *	d_keys,
		unsigned int *	d_values,
		unsigned int *	stringVals,
		size_t	numElements,
		size_t	stringArrayLength
	)

Sorts strings. Keys are the first four characters of the string, and values are the addresses where the strings reside in memory (stringVals)

Takes as input an array of strings (broken up as first four chars (key), addresses (values), and the strings themselves (stringVals) aligned by 4 character and packed into a uint)

Todo:: Determine if we need to provide an "out of place" sort interface.

Parameters

[in]	planHandle	handle to CUDPPSortPlan
[in,out]	d_keys	keys (first four chars of string to be sorted)
[in,out]	d_values	addresses where the strings reside
[in]	stringVals	Packed String input, series of characters each terminated by a null
[in]	numElements	number of elements in d_keys and d_values
[in]	stringArrayLength	Length in uint of the size of stromgVals

Returns: CUDPPResult indicating success or error condition

See also: cudppPlan, CUDPPConfiguration, CUDPPAlgorithm

CUDPP_DLL CUDPPResult cudppStringSort	(	const CUDPPHandle	planHandle,
		unsigned char *	d_stringVals,
		unsigned int *	d_address,
		unsigned char	termC,
		size_t	numElements,
		size_t	stringArrayLength
	)

Sorts strings. Keys are the first four characters of the string, and values are the addresses where the strings reside in memory (stringVals)

Takes as input an array of strings arranged as a char* array with NULL terminating characters. This function will reformat this info into keys (first four chars) values(pointers to string array addresses) and aligned string value array.

Parameters

[in]	planHandle	handle to CUDPPSortPlan
[in]	d_stringVals	Original string input, no need for alignment or offsets.
[in]	d_address	Pointers (in order) to each strings starting location in the stringVals array
[in]	termC	Termination character used to separate strings
[in]	numElements	number of strings
[in]	stringArrayLength	Length in uint of the size of all strings

Returns: CUDPPResult indicating success or error condition

See also: cudppPlan, CUDPPConfiguration, CUDPPAlgorithm

CUDPP_DLL CUDPPResult cudppSparseMatrixVectorMultiply	(	const CUDPPHandle	sparseMatrixHandle,
		void *	d_y,
		const void *	d_x
	)

Perform matrix-vector multiply y = A*x for arbitrary sparse matrix A and vector x.

Given a matrix object handle (which has been initialized using cudppSparseMatrix()), This function multiplies the input vector d_x by the matrix referred to by sparseMatrixHandle, returning the result in d_y.

Parameters

sparseMatrixHandle	Handle to a sparse matrix object created with cudppSparseMatrix()
d_y	The output vector, y
d_x	The input vector, x

Returns: CUDPPResult indicating success or error condition

See also: cudppSparseMatrix, cudppDestroySparseMatrix

CUDPP_DLL CUDPPResult cudppRand	(	const CUDPPHandle	planHandle,
		void *	d_out,
		size_t	numElements
	)

Rand puts numElements random 32-bit elements into d_out.

Outputs numElements random values to d_out. d_out must be of type unsigned int, allocated in device memory.

The algorithm used for the random number generation is stored in planHandle. Depending on the specification of the pseudo random number generator(PRNG), the generator may have one or more seeds. To set the seed, use cudppRandSeed().

Todo:: Currently only MD5 PRNG is supported. We may provide more rand routines in the future.

Parameters

[in]	planHandle	Handle to plan for rand
[in]	numElements	number of elements in d_out.
[out]	d_out	output of rand, in GPU memory. Should be an array of unsigned integers.

Returns: CUDPPResult indicating success or error condition

See also: cudppPlan, CUDPPConfiguration, CUDPPAlgorithm

CUDPP_DLL CUDPPResult cudppRandSeed	(	const CUDPPHandle	planHandle,
		unsigned int	seed
	)

Sets the seed used for rand.

The seed is crucial to any random number generator as it allows a sequence of random numbers to be replicated. Since there may be multiple different rand algorithms in CUDPP, cudppRandSeed uses planHandle to determine which seed to set. Each rand algorithm has its own unique set of seeds depending on what the algorithm needs.

Parameters

[in]	planHandle	the handle to the plan which specifies which rand seed to set
[in]	seed	the value which the internal cudpp seed will be set to

Returns: CUDPPResult indicating success or error condition

CUDPP_DLL CUDPPResult cudppTridiagonal	(	CUDPPHandle	planHandle,
		void *	d_a,
		void *	d_b,
		void *	d_c,
		void *	d_d,
		void *	d_x,
		int	systemSize,
		int	numSystems
	)

Solves tridiagonal linear systems.

The solver uses a hybrid CR-PCR algorithm described in our papers "Fast Fast Tridiagonal Solvers on the GPU" and "A Hybrid Method for Solving Tridiagonal Systems on the GPU". (See the References bibliography). Please refer to the papers for a complete description of the basic CR (Cyclic Reduction) and PCR (Parallel Cyclic Reduction) algorithms and their hybrid variants.

Both float and double data types are supported.
Both power-of-two and non-power-of-two system sizes are supported.
The maximum system size could be limited by the maximum number of threads of a CUDA block, the number of registers per multiprocessor, and the amount of shared memory available. For example, on the GTX 280 GPU, the maximum system size is 512 for the float datatype, and 256 for the double datatype, which is limited by the size of shared memory in this case.
The maximum number of systems is 65535, that is the maximum number of one-dimensional blocks that could be launched in a kernel call. Users could launch the kernel multiple times to solve more systems if required.

Parameters

[out]	d_x	Solution vector
[in]	planHandle	Handle to plan for tridiagonal solver
[in]	d_a	Lower diagonal
[in]	d_b	Main diagonal
[in]	d_c	Upper diagonal
[in]	d_d	Right hand side
[in]	systemSize	The size of the linear system
[in]	numSystems	The number of systems to be solved

Returns: CUDPPResult indicating success or error condition

See also: cudppPlan, CUDPPConfiguration, CUDPPAlgorithm

CUDPP_DLL CUDPPResult cudppCompress	(	CUDPPHandle	planHandle,
		unsigned char *	d_uncompressed,
		int *	d_bwtIndex,
		unsigned int *	d_histSize,
		unsigned int *	d_hist,
		unsigned int *	d_encodeOffset,
		unsigned int *	d_compressedSize,
		unsigned int *	d_compressed,
		size_t	numElements
	)

Compresses data stream.

Performs compression using a three stage pipeline consisting of the Burrows-Wheeler transform, the move-to-front transform, and Huffman encoding. The compression algorithms are described in our paper "Parallel Lossless Data Compression on the GPU". (See the References bibliography).

Only unsigned char type is supported.
Currently, the input stream (d_uncompressed) must be a buffer of 1,048,576 (uchar) elements (~1MB).
The BWT Index (d_bwtIndex) is an integer number (int). This is used during the reverse-BWT stage.
The Histogram size pointer (d_histSize) can be ignored and can be passed a null pointer.
The Histrogram (d_hist) is a 256-entry (unsigned int) buffer. The histogram is used to construct the Huffman tree during decoding.
The Encoded offset table (d_encodeOffset) is a 256-entry (unsigned int) buffer. Since the input stream is compressed in blocks of 4096 characters, the offset table gives the starting offset of where each block starts in the compressed data (d_compressedSize). The very first uint at each starting offset gives the size (in words) of that corresponding compressed block. This allows us to decompress each 4096 character-block in parallel.
The size of compressed data (d_compressedSize) is a uint and gives the final size (in words) of the compressed data.
The compress data stream (d_compressed) is a uint buffer. The user should allocate enough memory for worst-case (no compression occurs).
numElements is a uint and must be set to 1048576.

Parameters

[out]	d_bwtIndex	BWT Index (int)
[out]	d_histSize	Histogram size (ignored, null ptr)
[out]	d_hist	Histogram (256-entry, uint)
[out]	d_encodeOffset	Encoded offset table (256-entry, uint)
[out]	d_compressedSize	Size of compressed data (uint)
[out]	d_compressed	Compressed data
[in]	planHandle	Handle to plan for compressor
[in]	d_uncompressed	Uncompressed data
[in]	numElements	Number of elements to compress

Returns: CUDPPResult indicating success or error condition

See also: cudppPlan, CUDPPConfiguration, CUDPPAlgorithm

CUDPP_DLL CUDPPResult cudppBurrowsWheelerTransform	(	CUDPPHandle	planHandle,
		unsigned char *	d_in,
		unsigned char *	d_out,
		int *	d_index,
		size_t	numElements
	)

Performs the Burrows-Wheeler Transform.

Performs a parallel Burrows-Wheeler transform on 1,048,576 elements. The BWT leverages a string-sort algorithm based on merge-sort.

Currently, the BWT can only be performed on 1,048,576 (uchar) elements.
The transformed string is written to d_x.
The BWT index (used during the reverse-BWT) is recorded as an int in d_index.

Parameters

[in]	planHandle	Handle to plan for BWT
[out]	d_in	BWT Index
[out]	d_out	Output data
[in]	d_index	Input data
[in]	numElements	Number of elements

Returns: CUDPPResult indicating success or error condition

See also: cudppPlan, CUDPPConfiguration, CUDPPAlgorithm

CUDPP_DLL CUDPPResult cudppMoveToFrontTransform	(	CUDPPHandle	planHandle,
		unsigned char *	d_in,
		unsigned char *	d_out,
		size_t	numElements
	)

Performs the Move-to-Front Transform.

Performs a parallel move-to-front transform on 1,048,576 elements. The MTF uses a scan-based algorithm to parallelize the computation. The MTF uses a scan-based algorithm described in our paper "Parallel Lossless Data Compression on the GPU". (See the References bibliography).

Currently, the MTF can only be performed on 1,048,576 (uchar) elements.
The transformed string is written to d_mtfOut.

Parameters

[in]	planHandle	Handle to plan for MTF
[out]	d_out	Output data
[in]	d_in	Input data
[in]	numElements	Number of elements

Returns: CUDPPResult indicating success or error condition

See also: cudppPlan, CUDPPConfiguration, CUDPPAlgorithm

CUDPP_DLL CUDPPResult cudppListRank	(	CUDPPHandle	planHandle,
		void *	d_ranked_values,
		void *	d_unranked_values,
		void *	d_next_indices,
		size_t	head,
		size_t	numElements
	)

Performs list ranking of linked list node values.

Performs parallel list ranking on values of a linked-list using a pointer-jumping algorithm.

Takes as input an array of values in GPU memory (d_unranked_values) and an equal-sized int array in GPU memory (d_next_indices) that represents the next indices of the linked list. The index of the head node (head) is given as an unsigned int. The output (d_ranked_values) is an equal-sized array, in GPU memory, that has the values ranked in-order.

Example:

 d_a     = [  f a c d b e  ]
 d_b     = [ -1 4 3 5 2 0  ]
 head    = 1
 d_x     = [ a b c d e f ]

Parameters

[in]	planHandle	Handle to plan for list ranking
[out]	d_ranked_values	Output ranked values
[in]	d_unranked_values	Input unranked values
[in]	d_next_indices	Input next indices
[in]	head	Input head node index
[in]	numElements	number of nodes

Returns: CUDPPResult indicating success or error condition

See also: cudppPlan, CUDPPConfiguration, CUDPPAlgorithm

CUDPP_DLL CUDPPResult cudppSuffixArray	(	CUDPPHandle	planHandle,
		unsigned char *	d_in,
		unsigned int *	d_out,
		size_t	numElements
	)

Performs the Suffix Array.

Performs a parallel suffix array using linear-time recursive skew algorithm. The SA leverages a suffix-sort algorithm based on divide and conquer.

The SA is GPU memory bounded, it needs about seven times size of input data.
Only unsigned char type is supported.
The input char array is transformed into an unsigned int array storing the key values followed by three 0s for the convinience of building triplets.
The output data is an unsigned int array storing the positions of the lexicographically sorted suffixes not including the last {0,0,0} triplet.

Parameters

[in]	planHandle	Handle to plan for CUDPPSuffixArrayPlan
[out]	d_in	Input data
[out]	d_out	Output data
[in]	numElements	Number of elements

Returns: CUDPPResult indicating success or error condition

See also: cudppPlan, CUDPPConfiguration, CUDPPAlgorithm

CUDPP_DLL CUDPPResult cudppMultiSplit	(	const CUDPPHandle	planHandle,
		unsigned int *	d_keys,
		unsigned int *	d_values,
		size_t	numElements,
		size_t	numBuckets
	)

Splits an array of keys and an optional array of values into a set of buckets.

Takes as input an array of keys in GPU memory (d_keys) and an optional array of corresponding values, and outputs an arrays of keys and (optionally) values in place, where the keys and values have been split into ordered buckets. Key-value or key-only multisplit is selected through the configuration of the plan, using the options CUDPP_OPTION_KEYS_ONLY or CUDPP_OPTION_KEY_VALUE_PAIRS. The function used to map a key to a bucket is selected through the configuration option 'bucket_mapper'. The current options are:

ORDERED_CYCLIC_BUCKET_MAPPER (default): bucket = (key % numElements) / ((numElements + numBuckets - 1) / numBuckets);

MSB_BUCKET_MAPPER: bucket = (key >> (32 - ceil(log2(numBuckets)))) % numBuckets;

Currently, the only supported key and value type is CUDPP_UINT.

Parameters

[in]	planHandle	Handle to plan for CUDPPMultiSplitPlan
[in,out]	d_keys	keys by which key-value pairs will be split
[in,out]	d_values	values to be split
[in]	numElements	number of elements in d_keys and d_values
[in]	numBuckets	Number of buckets

Returns: CUDPPResult indicating success or error condition

See also: cudppPlan, CUDPPConfiguration, CUDPPAlgorithm

CUDPP_DLL CUDPPResult cudppMultiSplitCustomBucketMapper	(	const CUDPPHandle	planHandle,
		unsigned int *	d_keys,
		unsigned int *	d_values,
		size_t	numElements,
		size_t	numBuckets,
		BucketMappingFunc	bucketMappingFunc
	)

Splits an array of keys and an optional array of values into a set of buckets using a custom function to map elements to buckets.

Takes as input an array of keys in GPU memory (d_keys) and an optional array of corresponding values, and outputs an arrays of keys and (optionally) values in place, where the keys and values have been split into ordered buckets. Key-value or key-only multisplit is selected through the configuration of the plan, using the options CUDPP_OPTION_KEYS_ONLY or CUDPP_OPTION_KEY_VALUE_PAIRS. To use this function, the configuration option 'bucket_mapper' must be set to CUSTOM_BUCKET_MAPPER. This option lets the library know to use the custom function pointer, specified in the last argument, when assigning an element to a bucket. The user specified bucket mapper must be a function pointer to a device function that takes one unsigned int argument (the element) and returns an unsigned int (the bucket).

Currently, the only supported key and value type is CUDPP_UINT.

Parameters

[in]	planHandle	Handle to plan for BWT
[in,out]	d_keys	Input data
[in,out]	d_values	Output data
[in]	numElements	Number of elements
[in]	numBuckets	Number of buckets
[in]	bucketMappingFunc	function that maps an element to a bucket

Returns: CUDPPResult indicating success or error condition

See also: cudppPlan, CUDPPConfiguration, CUDPPAlgorithm

CUDPP_DLL CUDPPResult cudppCreate ( CUDPPHandle * theCudpp )

Creates an instance of the CUDPP library, and returns a handle.

cudppCreate() must be called before any other CUDPP function. In a multi-GPU application that uses multiple CUDA context, cudppCreate() must be called once for each CUDA context. Each call returns a different handle, because each CUDA context (and the host thread that owns it) must use a separate instance of the CUDPP library.

Parameters

[in,out] theCudpp a pointer to the CUDPPHandle for the created CUDPP instance.

Returns: CUDPPResult indicating success or error condition

CUDPP_DLL CUDPPResult cudppDestroy ( CUDPPHandle theCudpp )

Destroys an instance of the CUDPP library given its handle.

cudppDestroy() should be called once for each handle created using cudppCreate(), to ensure proper resource cleanup of all library instances.

Parameters

[in] theCudpp the handle to the CUDPP instance to destroy.

Returns: CUDPPResult indicating success or error condition

CUDPP_DLL CUDPPResult cudppPlan	(	const CUDPPHandle	cudppHandle,
		CUDPPHandle *	planHandle,
		CUDPPConfiguration	config,
		size_t	numElements,
		size_t	numRows,
		size_t	rowPitch
	)

Create a CUDPP plan.

A plan is a data structure containing state and intermediate storage space that CUDPP uses to execute algorithms on data. A plan is created by passing to cudppPlan() a CUDPPConfiguration that specifies the algorithm, operator, datatype, and options. The size of the data must also be passed to cudppPlan(), in the numElements, numRows, and rowPitch arguments. These sizes are used to allocate internal storage space at the time the plan is created. The CUDPP planner may use the sizes, options, and information about the present hardware to choose optimal settings.

Note that numElements is the maximum size of the array to be processed with this plan. That means that a plan may be re-used to process (for example, to sort or scan) smaller arrays.

Parameters

[out]	planHandle	A pointer to an opaque handle to the internal plan
[in]	cudppHandle	A handle to an instance of the CUDPP library used for resource management
[in]	config	The configuration struct specifying algorithm and options
[in]	numElements	The maximum number of elements to be processed
[in]	numRows	The number of rows (for 2D operations) to be processed
[in]	rowPitch	The pitch of the rows of input data, in elements

Returns: CUDPPResult indicating success or error condition

CUDPP_DLL CUDPPResult cudppDestroyPlan ( CUDPPHandle planHandle )

Destroy a CUDPP Plan.

Deletes the plan referred to by planHandle and all associated internal storage.

Parameters

[in] planHandle The CUDPPHandle to the plan to be destroyed

Returns: CUDPPResult indicating success or error condition

CUDPP_DLL CUDPPResult cudppSparseMatrix	(	const CUDPPHandle	cudppHandle,
		CUDPPHandle *	sparseMatrixHandle,
		CUDPPConfiguration	config,
		size_t	numNonZeroElements,
		size_t	numRows,
		const void *	A,
		const unsigned int *	h_rowIndices,
		const unsigned int *	h_indices
	)

Create a CUDPP Sparse Matrix Object.

The sparse matrix plan is a data structure containing state and intermediate storage space that CUDPP uses to perform sparse matrix dense vector multiply. This plan is created by passing to CUDPPSparseMatrixVectorMultiplyPlan() a CUDPPConfiguration that specifies the algorithm (sprarse matrix-dense vector multiply) and datatype, along with the sparse matrix itself in CSR format. The number of non-zero elements in the sparse matrix must also be passed as numNonZeroElements. This is used to allocate internal storage space at the time the sparse matrix plan is created.

Parameters

[out]	sparseMatrixHandle	A pointer to an opaque handle to the sparse matrix object
[in]	cudppHandle	A handle to an instance of the CUDPP library used for resource management
[in]	config	The configuration struct specifying algorithm and options
[in]	numNonZeroElements	The number of non zero elements in the sparse matrix
[in]	numRows	This is the number of rows in y, x and A for y = A * x
[in]	A	The matrix data
[in]	h_rowIndices	An array containing the index of the start of each row in A
[in]	h_indices	An array containing the index of each nonzero element in A

Returns: CUDPPResult indicating success or error condition

CUDPP_DLL CUDPPResult cudppDestroySparseMatrix ( CUDPPHandle sparseMatrixHandle )

Destroy a CUDPP Sparse Matrix Object.

Deletes the sparse matrix data and plan referred to by sparseMatrixHandle and all associated internal storage.

Parameters

[in] sparseMatrixHandle The CUDPPHandle to the matrix object to be destroyed

Returns: CUDPPResult indicating success or error condition

CUDPP_DLL CUDPPResult cudppHashTable	(	CUDPPHandle	cudppHandle,
		CUDPPHandle *	plan,
		const CUDPPHashTableConfig *	config
	)

Creates a CUDPP hash table in GPU memory given an input hash table configuration; returns the plan for that hash table.

Requires a CUDPPHandle for the CUDPP instance (to ensure thread safety); call cudppCreate() to get this handle.

The hash table implementation requires hardware capability 2.0 or higher (64-bit atomic operations).

Hash table types and input parameters are discussed in CUDPPHashTableType and CUDPPHashTableConfig.

After you are finished with the hash table, clean up with cudppDestroyHashTable().

See Overview of CUDPP hash tables for an overview of CUDPP's hash table support.

Parameters

[in]	cudppHandle	Handle to CUDPP instance
[out]	plan	Handle to hash table instance
[in]	config	Configuration for hash table to be created

Returns: CUDPPResult indicating if creation was successful

See also: cudppCreate, cudppDestroyHashTable, CUDPPHashTableType, CUDPPHashTableConfig, Overview of CUDPP hash tables

CUDPP_DLL CUDPPResult cudppHashInsert	(	CUDPPHandle	plan,
		const void *	d_keys,
		const void *	d_vals,
		size_t	num
	)

Inserts keys and values into a CUDPP hash table.

Requires a CUDPPHandle for the hash table instance; call cudppHashTable() to create the hash table and get this handle.

d_keys and d_values should be in GPU memory. These should be pointers to arrays of unsigned ints.

Calls HashTable::Build internally.

See Overview of CUDPP hash tables for an overview of CUDPP's hash table support.

Parameters

[in]	plan	Handle to hash table instance
[in]	d_keys	GPU pointer to keys to be inserted
[in]	d_vals	GPU pointer to values to be inserted
[in]	num	Number of keys/values to be inserted

Returns: CUDPPResult indicating if insertion was successful

See also: cudppHashTable, cudppHashRetrieve, HashTable::Build, CompactingHashTable::Build, MultivalueHashTable::Build, Overview of CUDPP hash tables

CUDPP_DLL CUDPPResult cudppHashRetrieve	(	CUDPPHandle	plan,
		const void *	d_keys,
		void *	d_vals,
		size_t	num
	)

Retrieves values, given keys, from a CUDPP hash table.

Requires a CUDPPHandle for the hash table instance; call cudppHashTable() to create the hash table and get this handle.

d_keys and d_values should be in GPU memory. These should be pointers to arrays of unsigned ints.

Calls HashTable::Retrieve internally.

See Overview of CUDPP hash tables for an overview of CUDPP's hash table support.

Parameters

[in]	plan	Handle to hash table instance
[in]	d_keys	GPU pointer to keys to be retrieved
[out]	d_vals	GPU pointer to values to be retrieved
[in]	num	Number of keys/values to be retrieved

Returns: CUDPPResult indicating if retrieval was successful

See also: cudppHashTable, cudppHashBuild, HashTable::Retrieve, CompactingHashTable::Retrieve, MultivalueHashTable::Retrieve, Overview of CUDPP hash tables

CUDPP_DLL CUDPPResult cudppDestroyHashTable	(	CUDPPHandle	cudppHandle,
		CUDPPHandle	plan
	)

Destroys a hash table given its handle.

Requires a CUDPPHandle for the CUDPP instance (to ensure thread safety); call cudppCreate() to get this handle.

Requires a CUDPPHandle for the hash table instance; call cudppHashTable() to get this handle.

See Overview of CUDPP hash tables for an overview of CUDPP's hash table support.

Parameters

[in]	cudppHandle	Handle to CUDPP instance
[in]	plan	Handle to hash table instance

Returns: CUDPPResult indicating if destruction was successful

See also: cudppHashTable, Overview of CUDPP hash tables

CUDPP_DLL CUDPPResult cudppMultivalueHashGetValuesSize	(	CUDPPHandle	plan,
		unsigned int *	size
	)

Retrieves the size of the values array in a multivalue hash table.

Only relevant for multivalue hash tables.

Requires a CUDPPHandle for the hash table instance; call cudppHashTable() to get this handle.

See Overview of CUDPP hash tables for an overview of CUDPP's hash table support.

Parameters

[in]	plan	Handle to hash table instance
[out]	size	Pointer to size of multivalue hash table

Returns: CUDPPResult indicating if operation was successful

See also: cudppHashTable, cudppMultivalueHashGetAllValues, Overview of CUDPP hash tables

CUDPP_DLL CUDPPResult cudppMultivalueHashGetAllValues	(	CUDPPHandle	plan,
		unsigned int **	d_vals
	)

Retrieves a pointer to the values array in a multivalue hash table.

Only relevant for multivalue hash tables.

Requires a CUDPPHandle for the hash table instance; call cudppHashTable() to get this handle.

See Overview of CUDPP hash tables for an overview of CUDPP's hash table support.

Parameters

[in]	plan	Handle to hash table instance
[out]	d_vals	Pointer to pointer of values (in GPU memory)

Returns: CUDPPResult indicating if operation was successful

See also: cudppHashTable, cudppMultivalueHashGetValuesSize, Overview of CUDPP hash tables

Algorithm Interface

Library Management Interface

Plan Interface

Hash Table Interface

Detailed Description

Function Documentation