CUDPP: A Simple CUDPP Example

This code sample demonstrates a basic usage of CUDPP for computing the parallel prefix sum of a floating point array on the GPU.

Sample Code Walkthrough

The simpleCUDPP sample is the "hello" world example for CUDPP. Its aim is to show you how to initialize, run, and shut down CUDPP functions.

The main function in simpleCUDPP.cu is runTest().

runTest()

runTest starts by initializing the CUDA device and then declaring the number of elements and the array size for the arrays we plan to scan. It then allocates the host-side (CPU-side) input array, h_idata, and initializes the data it contains with some random values between 0 and 15.

void
runTest( int argc, char** argv) 
{
    int deviceCount;
    cudaGetDeviceCount(&deviceCount);
    if (deviceCount == 0) {
        fprintf(stderr, "error: no devices supporting CUDA.\n");
        exit(EXIT_FAILURE);
    }
    int dev = 0;
    if (argc > 1) {
        std::string arg = argv[1];
        size_t pos = arg.find("=");
        if (arg.find("device") && pos != std::string::npos) {
            dev = atoi(arg.c_str() + (pos + 1));
        }
    }
    if (dev < 0) dev = 0;
    if (dev > deviceCount-1) dev = deviceCount - 1;
    cudaSetDevice(dev);

    cudaDeviceProp prop;
    if (cudaGetDeviceProperties(&prop, dev) == cudaSuccess)
    {
        printf("Using device %d:\n", dev);
        printf("%s; global mem: %dB; compute v%d.%d; clock: %d kHz\n",
               prop.name, (int)prop.totalGlobalMem, (int)prop.major, 
               (int)prop.minor, (int)prop.clockRate);
    }

After the input data is created on the host, we allocate a device (GPU) array d_idata and copy the input data from the host using cudaMemcpy(). We also allocate a device array for the output results, d_odata.

    unsigned int numElements = 32768;
    unsigned int memSize = sizeof( float) * numElements;

    // allocate host memory
    float* h_idata = (float*) malloc( memSize);
    // initalize the memory
    for (unsigned int i = 0; i < numElements; ++i) 
    {
        h_idata[i] = (float) (rand() & 0xf);
    }

    // allocate device memory
    float* d_idata;
    cudaError_t result = cudaMalloc( (void**) &d_idata, memSize);
    if (result != cudaSuccess) {
        printf("Error: %s\n", cudaGetErrorString(result));
        exit(-1);
    }
    
    // copy host memory to device
    result = cudaMemcpy( d_idata, h_idata, memSize, cudaMemcpyHostToDevice);
    if (result != cudaSuccess) {
        printf("Error: %s\n", cudaGetErrorString(result));
        exit(-1);
    }
     
    // allocate device memory for result
    float* d_odata;
    result = cudaMalloc( (void**) &d_odata, memSize);
    if (result != cudaSuccess) {
        printf("Error: %s\n", cudaGetErrorString(result));
        exit(-1);
    }

CUDPP Initialization

Before we use the CUDPP library, we must initialize it. We do this by calling cudppCreate(), which returns a handle to the CUDPP library object. This handle must be used when creating CUDPP plans (described next), so it must be held by the calling thread until it is ready to shut down CUDPP.

    // Initialize the CUDPP Library
    CUDPPHandle theCudpp;
    cudppCreate(&theCudpp);

CUDPP Plans

Next comes the real CUDPP stuff. First we have to configure CUDPP to scan our array. Configuration of algorithms in CUDPP relies on the concept of the plan. A plan is a data structure that maintains intermediate storage for the algorithm, as well as information that CUDPP may use to optimize execution of the present hardware. When invoked using cudppPlan(), the CUDPP planner takes the configuration details passed to it and generates an internal plan object. It returns a CUDPPHandle -- an opaque pointer type that is used to refer to the plan object -- that must be passed to other CUDPP functions in order to execute algorithms.

In this case we are going to do a forward exclusive float sum-scan of numElements elements. We tell the planner this by filling out a CUDPPConfiguration struct with the algorithm (CUDPP_SCAN), datatype (CUDPP_FLOAT), operation (CUDPP_ADD), and options (CUDPP_OPTION_FORWARD, CUDPP_OPTION_EXCLUSIVE). We then pass this config to cudppPlan along with the maximum number of elements we want to scan, numElements. Finally, we pass 1 and 0 for the numRows and rowPitch parameters, since we only want to scan a one-dimensional array. See the documentation for cudppPlan() for more details on these parameters.

    CUDPPConfiguration config;
    config.op = CUDPP_ADD;
    config.datatype = CUDPP_FLOAT;
    config.algorithm = CUDPP_SCAN;
    config.options = CUDPP_OPTION_FORWARD | CUDPP_OPTION_EXCLUSIVE;
    
    CUDPPHandle scanplan = 0;
    CUDPPResult res = cudppPlan(theCudpp, &scanplan, config, numElements, 1, 0);  

    if (CUDPP_SUCCESS != res)
    {
        printf("Error creating CUDPPPlan\n");
        exit(-1);
    }

We now have a handle to our plan object in scanplan. Next, after making sure that cudppPlan() did not return an error, we put CUDPP to work by invoking cudppScan(), to which we pass our plan handle, the output and input device arrays, and the number of elements to scan.

    // Run the scan
    res = cudppScan(scanplan, d_odata, d_idata, numElements);
    if (CUDPP_SUCCESS != res)
    {
        printf("Error in cudppScan()\n");
        exit(-1);
    }

Next, we read the results of the scan from d_odata back to the host, compute a reference solution on the CPU (computeSumScanGold()), and compare the results for correctness.

    // allocate mem for the result on host side
    float* h_odata = (float*) malloc( memSize);
    // copy result from device to host
    result = cudaMemcpy( h_odata, d_odata, memSize, cudaMemcpyDeviceToHost);
    if (result != cudaSuccess) {
        printf("Error: %s\n", cudaGetErrorString(result));
        exit(-1);
    }

Finally, we tell CUDPP to clean up the memory used for our plan object, using cudppDestroyPlan(). Then we shut down the CUDPP library using cudppDestroy(), passing it the handle we received from cudppCreate(). Finally we free the host and device arrays using free() and cudaFree, respectively.

    
    // compute reference solution
    float* reference = (float*) malloc( memSize);
    computeSumScanGold( reference, h_idata, numElements, config);

    // check result
    bool passed = true;
    for (unsigned int i = 0; i < numElements; i++)
        if (reference[i] != h_odata[i]) passed = false;
        
    printf( "Test %s\n", passed ? "PASSED" : "FAILED");

    res = cudppDestroyPlan(scanplan);
    if (CUDPP_SUCCESS != res)
    {
        printf("Error destroying CUDPPPlan\n");
        exit(-1);
    }

    // shut down the CUDPP library
    cudppDestroy(theCudpp);
    
    free( h_idata);
    free( h_odata);
    free( reference);
    cudaFree(d_idata);
    cudaFree(d_odata);
}

Using CUDPP for parallel prefix sums is easy!