CUDPP
2.1
CUDA Data-Parallel Primitives Library
|
CUDPP application-level reduction routines. More...
#include <stdio.h>
#include "cuda_util.h"
#include "cudpp_plan.h"
#include "cudpp_util.h"
#include "kernel/reduce_kernel.cuh"
Functions | |
Reduce Functions | |
template<class T , class Oper > | |
void | reduceBlocks (T *d_odata, const T *d_idata, size_t numElements, const CUDPPReducePlan *plan) |
Per-block reduction function. More... | |
template<class Oper , class T > | |
void | reduceArray (T *d_odata, const T *d_idata, size_t numElements, const CUDPPReducePlan *plan) |
Array reduction function. More... | |
void | allocReduceStorage (CUDPPReducePlan *plan) |
Allocate intermediate arrays used by reductions. More... | |
void | freeReduceStorage (CUDPPReducePlan *plan) |
Deallocate intermediate block sums arrays in a CUDPPReducePlan object. More... | |
void | cudppReduceDispatch (void *d_odata, const void *d_idata, size_t numElements, const CUDPPReducePlan *plan) |
Dispatch function to perform a parallel reduction on an array with the specified configuration. More... | |
CUDPP application-level reduction routines.
void reduceBlocks | ( | T * | d_odata, |
const T * | d_idata, | ||
size_t | numElements, | ||
const CUDPPReducePlan * | plan | ||
) |
Per-block reduction function.
This function dispatches the appropriate reduction kernel given the size of the blocks.
[out] | d_odata | The output data pointer. Each block writes a single output element. |
[in] | d_idata | The input data pointer. |
[in] | numElements | The number of elements to be reduced. |
[in] | plan | A pointer to the plan structure for the reduction. |
void reduceArray | ( | T * | d_odata, |
const T * | d_idata, | ||
size_t | numElements, | ||
const CUDPPReducePlan * | plan | ||
) |
Array reduction function.
Performs multi-level reduction on large arrays using reduceBlocks().
[out] | d_odata | The output data pointer. This is a pointer to a single element. |
[in] | d_idata | The input data pointer. |
[in] | numElements | The number of elements to be reduced. |
[in] | plan | A pointer to the plan structure for the reduction. |
void allocReduceStorage | ( | CUDPPReducePlan * | plan | ) |
Allocate intermediate arrays used by reductions.
Reductions of large arrays must be split into multiple blocks, where each block is reduced by a single CUDA thread block. Each block writes its partial sum to global memory where it is reduced to a single element in a second pass.
[in,out] | plan | Pointer to CUDPPReducePlan object containing options and number of elements, which is used to compute storage requirements, and within which intermediate storage is allocated. |
void freeReduceStorage | ( | CUDPPReducePlan * | plan | ) |
Deallocate intermediate block sums arrays in a CUDPPReducePlan object.
These arrays must have been allocated by allocScanStorage(), which is called by the constructor of cudppReducePlan().
[in,out] | plan | Pointer to CUDPPReducePlan object initialized by allocScanStorage(). |
void cudppReduceDispatch | ( | void * | d_odata, |
const void * | d_idata, | ||
size_t | numElements, | ||
const CUDPPReducePlan * | plan | ||
) |
Dispatch function to perform a parallel reduction on an array with the specified configuration.
This is the dispatch routine which calls reduceArray() with appropriate template parameters and arguments to achieve the scan as specified in plan.
[out] | d_odata | The output array of scan results |
[in] | d_idata | The input array |
[in] | numElements | The number of elements to scan |
[in] | plan | Pointer to CUDPPReducePlan object containing reduce options and intermediate storage |