CUDPP 1.1.1
|
This code sample demonstrates a basic usage of CUDPP for computing the parallel prefix sum of a floating point array on the GPU.
The simpleCUDPP sample is the "hello" world example for CUDPP. Its aim is to show you how to initialize, run, and shut down CUDPP functions.
The main function in simpleCUDPP.cu is runTest()
.
simpleCUDPP uses libCUTIL, a CUDA utility library used in NVIDIA CUDA SDK samples to initialize CUDA and to check for errors. runTest
starts by initializing the CUDA device and then declaring the number of elements and the array size for the arrays we plan to scan. It then allocates the host-side (CPU-side) input array, h_idata
, and initializes the data it contains with some random values between 0 and 15.
void runTest( int argc, char** argv) { CUT_DEVICE_INIT(argc, argv); unsigned int numElements = 32768; unsigned int memSize = sizeof( float) * numElements; // allocate host memory float* h_idata = (float*) malloc( memSize); // initalize the memory for (unsigned int i = 0; i < numElements; ++i) { h_idata[i] = (float) (rand() & 0xf); }
After the input data is created on the host, we allocate a device (GPU) array d_idata
and copy the input data from the host using cudaMemcpy()
. We also allocate a device array for the output results, d_odata
.
// allocate device memory float* d_idata; CUDA_SAFE_CALL( cudaMalloc( (void**) &d_idata, memSize)); // copy host memory to device CUDA_SAFE_CALL( cudaMemcpy( d_idata, h_idata, memSize, cudaMemcpyHostToDevice) ); // allocate device memory for result float* d_odata; CUDA_SAFE_CALL( cudaMalloc( (void**) &d_odata, memSize));
Next comes the real CUDPP stuff. First we have to configure CUDPP to scan our array. Configuration of algorithms in CUDPP relies on the concept of the plan. A plan is a data structure that maintains intermediate storage for the algorithm, as well as information that CUDPP may use to optimize execution of the present hardware. When invoked using cudppPlan(), the CUDPP planner takes the configuration details passed to it and generates an internal plan object. It returns a CUDPPHandle -- an opaque pointer type that is used to refer to the plan object -- that must be passed to other CUDPP functions in order to execute algorithms.
In this case we are going to do a forward exclusive float
sum-scan of numElements
elements. We tell the planner this by filling out a CUDPPConfiguration struct with the algorithm (CUDPP_SCAN), datatype (CUDPP_FLOAT), operation (CUDPP_ADD), and options (CUDPP_OPTION_FORWARD, CUDPP_OPTION_EXCLUSIVE). We then pass this config to cudppPlan along with the maximum number of elements we want to scan, numElements. Finally, we pass 1 and 0 for the numRows and rowPitch parameters, since we only want to scan a one-dimensional array. See the documentation for cudppPlan() for more details on these parameters.
CUDPPConfiguration config; config.op = CUDPP_ADD; config.datatype = CUDPP_FLOAT; config.algorithm = CUDPP_SCAN; config.options = CUDPP_OPTION_FORWARD | CUDPP_OPTION_EXCLUSIVE; CUDPPHandle scanplan = 0; CUDPPResult result = cudppPlan(&scanplan, config, numElements, 1, 0); if (CUDPP_SUCCESS != result) { printf("Error creating CUDPPPlan\n"); exit(-1); }
We now have a handle to our plan object in scanplan. Next, after making sure that cudppPlan() did not return an error, we put CUDPP to work by invoking cudppScan(), to which we pass our plan handle, the output and input device arrays, and the number of elements to scan.
// Run the scan cudppScan(scanplan, d_odata, d_idata, numElements);
Next, we read the results of the scan from d_odata back to the host, compute a reference solution on the CPU (computeSumScanGold()
), and compare the results for correctness.
// allocate mem for the result on host side float* h_odata = (float*) malloc( memSize); // copy result from device to host CUDA_SAFE_CALL( cudaMemcpy( h_odata, d_odata, memSize, cudaMemcpyDeviceToHost) ); // compute reference solution float* reference = (float*) malloc( memSize); computeSumScanGold( reference, h_idata, numElements, config); // check result CUTBoolean res = cutComparef( reference, h_odata, numElements); printf( "Test %s\n", (1 == res) ? "PASSED" : "FAILED");
Finally, we tell CUDPP to clean up the memory used for our plan object, using cudppDestroyPlan(). Then we free the host and device arrays using free()
and cudaFree
, respectively. After that we're finished.
result = cudppDestroyPlan(scanplan); if (CUDPP_SUCCESS != result) { printf("Error destroying CUDPPPlan\n"); exit(-1); }
Using CUDPP for parallel prefix sums is easy!