CUDPP: A Simple CUDPP Example

This code sample demonstrates a basic usage of CUDPP for computing the parallel prefix sum of a floating point array on the GPU.

Sample Code Walkthrough

The simpleCUDPP sample is the "hello" world example for CUDPP. Its aim is to show you how to initialize, run, and shut down CUDPP functions.

The main function in simpleCUDPP.cu is runTest().

runTest()

simpleCUDPP uses libCUTIL, a CUDA utility library used in NVIDIA CUDA SDK samples to initialize CUDA and to check for errors. runTest starts by initializing the CUDA device and then declaring the number of elements and the array size for the arrays we plan to scan. It then allocates the host-side (CPU-side) input array, h_idata, and initializes the data it contains with some random values between 0 and 15.

void
runTest( int argc, char** argv) 
{
    CUT_DEVICE_INIT(argc, argv);

    unsigned int numElements = 32768;
    unsigned int memSize = sizeof( float) * numElements;

    // allocate host memory
    float* h_idata = (float*) malloc( memSize);
    // initalize the memory
    for (unsigned int i = 0; i < numElements; ++i) 
    {
        h_idata[i] = (float) (rand() & 0xf);
    }

After the input data is created on the host, we allocate a device (GPU) array d_idata and copy the input data from the host using cudaMemcpy(). We also allocate a device array for the output results, d_odata.

    // allocate device memory
    float* d_idata;
    CUDA_SAFE_CALL( cudaMalloc( (void**) &d_idata, memSize));
    // copy host memory to device
    CUDA_SAFE_CALL( cudaMemcpy( d_idata, h_idata, memSize,
                                cudaMemcpyHostToDevice) );

    // allocate device memory for result
    float* d_odata;
    CUDA_SAFE_CALL( cudaMalloc( (void**) &d_odata, memSize));

CUDPP Plans

Next comes the real CUDPP stuff. First we have to configure CUDPP to scan our array. Configuration of algorithms in CUDPP relies on the concept of the plan. A plan is a data structure that maintains intermediate storage for the algorithm, as well as information that CUDPP may use to optimize execution of the present hardware. When invoked using cudppPlan(), the CUDPP planner takes the configuration details passed to it and generates an internal plan object. It returns a CUDPPHandle -- an opaque pointer type that is used to refer to the plan object -- that must be passed to other CUDPP functions in order to execute algorithms.

In this case we are going to do a forward exclusive float sum-scan of numElements elements. We tell the planner this by filling out a CUDPPConfiguration struct with the algorithm (CUDPP_SCAN), datatype (CUDPP_FLOAT), operation (CUDPP_ADD), and options (CUDPP_OPTION_FORWARD, CUDPP_OPTION_EXCLUSIVE). We then pass this config to cudppPlan along with the maximum number of elements we want to scan, numElements. Finally, we pass 1 and 0 for the numRows and rowPitch parameters, since we only want to scan a one-dimensional array. See the documentation for cudppPlan() for more details on these parameters.

    CUDPPConfiguration config;
    config.op = CUDPP_ADD;
    config.datatype = CUDPP_FLOAT;
    config.algorithm = CUDPP_SCAN;
    config.options = CUDPP_OPTION_FORWARD | CUDPP_OPTION_EXCLUSIVE;
    
    CUDPPHandle scanplan = 0;
    CUDPPResult result = cudppPlan(&scanplan, config, numElements, 1, 0);  

    if (CUDPP_SUCCESS != result)
    {
        printf("Error creating CUDPPPlan\n");
        exit(-1);
    }

We now have a handle to our plan object in scanplan. Next, after making sure that cudppPlan() did not return an error, we put CUDPP to work by invoking cudppScan(), to which we pass our plan handle, the output and input device arrays, and the number of elements to scan.

    // Run the scan
    cudppScan(scanplan, d_odata, d_idata, numElements);

Next, we read the results of the scan from d_odata back to the host, compute a reference solution on the CPU (computeSumScanGold()), and compare the results for correctness.

    // allocate mem for the result on host side
    float* h_odata = (float*) malloc( memSize);
    // copy result from device to host
    CUDA_SAFE_CALL( cudaMemcpy( h_odata, d_odata, memSize,
                                cudaMemcpyDeviceToHost) );
    // compute reference solution
    float* reference = (float*) malloc( memSize);
    computeSumScanGold( reference, h_idata, numElements, config);

    // check result
    CUTBoolean res = cutComparef( reference, h_odata, numElements);
    printf( "Test %s\n", (1 == res) ? "PASSED" : "FAILED");

Finally, we tell CUDPP to clean up the memory used for our plan object, using cudppDestroyPlan(). Then we free the host and device arrays using free() and cudaFree, respectively. After that we're finished.

    result = cudppDestroyPlan(scanplan);
    if (CUDPP_SUCCESS != result)
    {
        printf("Error destroying CUDPPPlan\n");
        exit(-1);
    }

Using CUDPP for parallel prefix sums is easy!