CUDPP 1.1.1
CUDPP Application-Level API

Compact Functions

void calculatCompactLaunchParams (const unsigned int numElements, unsigned int &numThreads, unsigned int &numBlocks, unsigned int &numEltsPerBlock)
 Calculate launch parameters for compactArray().
template<class T >
void compactArray (T *d_out, size_t *d_numValidElements, const T *d_in, const unsigned int *d_isValid, size_t numElements, const CUDPPCompactPlan *plan)
 Compact the non-zero elements of an array.
void allocCompactStorage (CUDPPCompactPlan *plan)
 Allocate intermediate arrays used by cudppCompact().
void freeCompactStorage (CUDPPCompactPlan *plan)
 Deallocate intermediate storage used by cudppCompact().
void cudppCompactDispatch (void *d_out, size_t *d_numValidElements, const void *d_in, const unsigned int *d_isValid, size_t numElements, const CUDPPCompactPlan *plan)
 Dispatch compactArray for the specified datatype.

RadixSort Functions

typedef unsigned int uint
template<uint nbits, uint startbit, bool flip, bool unflip>
void radixSortStep (uint *keys, uint *values, const CUDPPRadixSortPlan *plan, uint numElements)
 Perform one step of the radix sort. Sorts by nbits key bits per step, starting at startbit.
template<bool flip>
void radixSortSingleBlock (uint *keys, uint *values, uint numElements)
 Single-block optimization for sorts of fewer than 4 * CTA_SIZE elements.
void radixSort (uint *keys, uint *values, const CUDPPRadixSortPlan *plan, size_t numElements, bool flipBits, int keyBits)
 Main radix sort function.
void radixSortFloatKeys (float *keys, uint *values, const CUDPPRadixSortPlan *plan, size_t numElements, bool negativeKeys, int keyBits)
 Wrapper to call main radix sort function. For float configuration.
template<uint nbits, uint startbit, bool flip, bool unflip>
void radixSortStepKeysOnly (uint *keys, const CUDPPRadixSortPlan *plan, uint numElements)
 Perform one step of the radix sort. Sorts by nbits key bits per step, starting at startbit.
template<bool flip>
void radixSortSingleBlockKeysOnly (uint *keys, uint numElements)
 Optimization for sorts of fewer than 4 * CTA_SIZE elements (keys only).
void radixSortKeysOnly (uint *keys, const CUDPPRadixSortPlan *plan, bool flipBits, size_t numElements, int keyBits)
 Main radix sort function. For keys only configuration.
void radixSortFloatKeysOnly (float *keys, const CUDPPRadixSortPlan *plan, bool negativeKeys, size_t numElements, int keyBits)
 Wrapper to call main radix sort function. For floats and keys only.
void initDeviceParameters (CUDPPRadixSortPlan *plan)
void allocRadixSortStorage (CUDPPRadixSortPlan *plan)
 From the programmer-specified sort configuration, creates internal memory for performing the sort.
void freeRadixSortStorage (CUDPPRadixSortPlan *plan)
 Deallocates intermediate memory from allocRadixSortStorage.
void cudppRadixSortDispatch (void *keys, void *values, size_t numElements, int keyBits, const CUDPPRadixSortPlan *plan)
 Dispatch function to perform a sort on an array with a specified configuration.

Scan Functions

template<class T , bool isBackward, bool isExclusive, CUDPPOperator op>
void scanArrayRecursive (T *d_out, const T *d_in, T **d_blockSums, size_t numElements, size_t numRows, const size_t *rowPitches, int level)
 Perform recursive scan on arbitrary size arrays.
void allocScanStorage (CUDPPScanPlan *plan)
 Allocate intermediate arrays used by scan.
void freeScanStorage (CUDPPScanPlan *plan)
 Deallocate intermediate block sums arrays in a CUDPPScanPlan object.
void cudppScanDispatch (void *d_out, const void *d_in, size_t numElements, size_t numRows, const CUDPPScanPlan *plan)
 Dispatch function to perform a scan (prefix sum) on an array with the specified configuration.

Segmented Scan Functions

template<class T , CUDPPOperator op, bool isBackward, bool isExclusive, bool doShiftFlagsLeft>
void segmentedScanArrayRecursive (T *d_out, const T *d_idata, const unsigned int *d_iflags, T **d_blockSums, unsigned int **d_blockFlags, unsigned int **d_blockIndices, int numElements, int level)
 Perform recursive scan on arbitrary size arrays.
void allocSegmentedScanStorage (CUDPPSegmentedScanPlan *plan)
 Allocate intermediate block sums, block flags and block indices arrays in a CUDPPSegmentedScanPlan class.
void freeSegmentedScanStorage (CUDPPSegmentedScanPlan *plan)
 Deallocate intermediate block sums, block flags and block indices arrays in a CUDPPSegmentedScanPlan class.
void cudppSegmentedScanDispatch (void *d_out, const void *d_idata, const unsigned int *d_iflags, int numElements, const CUDPPSegmentedScanPlan *plan)
 Dispatch function to perform a scan (prefix sum) on an array with the specified configuration.

Sparse Matrix-Vector Multiply Functions

template<class T >
void sparseMatrixVectorMultiply (T *d_y, const T *d_x, const CUDPPSparseMatrixVectorMultiplyPlan *plan)
 Perform matrix-vector multiply for sparse matrices and vectors of arbitrary size.
void allocSparseMatrixVectorMultiplyStorage (CUDPPSparseMatrixVectorMultiplyPlan *plan, const void *A, const unsigned int *rowindx, const unsigned int *indx)
 Allocate intermediate product, flags and rowFindx (index of the last element of each row) array .
void freeSparseMatrixVectorMultiplyStorage (CUDPPSparseMatrixVectorMultiplyPlan *plan)
 Deallocate intermediate product, flags and rowFindx (index of the last element of each row) array .
void cudppSparseMatrixVectorMultiplyDispatch (void *d_y, const void *d_x, const CUDPPSparseMatrixVectorMultiplyPlan *plan)
 Dispatch function to perform a sparse matrix-vector multiply with the specified configuration.

Detailed Description

The CUDPP Application-Level API contains functions that run on the host CPU and invoke GPU routines in the CUDPP Kernel-Level API. Application-Level API functions are used by CUDPP Public Interface functions to implement CUDPP's core functionality.


Function Documentation

void calculatCompactLaunchParams ( const unsigned int  numElements,
unsigned int &  numThreads,
unsigned int &  numBlocks,
unsigned int &  numEltsPerBlock 
)

Calculate launch parameters for compactArray().

Calculates the block size and number of blocks from the total number of elements and the maximum threads per block. Called by compactArray().

The calculation is pretty straightforward - the number of blocks is calculated by dividing the number of input elements by the product of the number of threads in each CTA and the number of elements each thread will process. numThreads and numEltsPerBlock are also simple to calculate. Please note that in cases where numElements is not an exact multiple of SCAN_ELTS_PER_THREAD * CTA_SIZE we would have threads which do nothing or have a thread which will process less than SCAN_ELTS_PER_THREAD elements.

Parameters:
[in]numElementsNumber of elements to sort
[out]numThreadsNumber of threads in each block
[out]numBlocksNumber of blocks
[out]numEltsPerBlockNumber of elements processed per block
template<class T >
void compactArray ( T *  d_out,
size_t *  d_numValidElements,
const T *  d_in,
const unsigned int *  d_isValid,
size_t  numElements,
const CUDPPCompactPlan plan 
)

Compact the non-zero elements of an array.

Given an input array d_in, compactArray() outputs a compacted version which does not have null (zero) elements. Also ouputs the number of non-zero elements in the compacted array. Called by cudppCompactDispatch().

The algorithm is straightforward, involving two steps (most of the complexity is hidden in scan, invoked with cudppScanDispatch() ).

  1. scanArray() performs a prefix sum on d_isValid to compute output indices.
  2. compactData() takes d_in and an intermediate array of output indices as input and writes the values with valid flags in d_isValid into d_out using the output indices.
Parameters:
[out]d_outArray of compacted non-null elements
[out]d_numValidElementsPointer to unsigned int to store number of non-null elements
[in]d_inInput array
[out]d_isValidArray of flags, 1 for each non-null element, 0 for each null element. Same length as d_in
[in]numElementsNumber of elements in input array
[in]planPointer to the plan object used for this compact
void allocCompactStorage ( CUDPPCompactPlan plan)

Allocate intermediate arrays used by cudppCompact().

In addition to the internal CUDPPScanPlan contained in CUDPPCompactPlan, CUDPPCompact also needs a temporary device array of output indices, which is allocated by this function.

Parameters:
planPointer to CUDPPCompactPlan object within which intermediate storage is allocated.
void freeCompactStorage ( CUDPPCompactPlan plan)

Deallocate intermediate storage used by cudppCompact().

Deallocates the output indices array allocated by allocCompactStorage().

Parameters:
planPointer to CUDPPCompactPlan object initialized by allocCompactStorage().
void cudppCompactDispatch ( void *  d_out,
size_t *  d_numValidElements,
const void *  d_in,
const unsigned int *  d_isValid,
size_t  numElements,
const CUDPPCompactPlan plan 
)

Dispatch compactArray for the specified datatype.

A thin wrapper on top of compactArray which calls compactArray() for the data type specified in config. This is the app-level interface to compact used by cudppCompact().

Parameters:
[out]d_outCompacted array of non-zero elements
[out]d_numValidElementsPointer to an unsigned int to store the number of non-zero elements
[in]d_inInput array
[in]d_isValidArray of boolean valid flags with same length as d_in
[in]numElementsNumber of elements to compact
[in]planPointer to plan object for this compact
template<uint nbits, uint startbit, bool flip, bool unflip>
void radixSortStep ( uint *  keys,
uint *  values,
const CUDPPRadixSortPlan *  plan,
uint  numElements 
)

Perform one step of the radix sort. Sorts by nbits key bits per step, starting at startbit.

Uses cudppScanDispatch() for the prefix sum of radix counters.

Parameters:
[in,out]keysKeys to be sorted.
[in,out]valuesAssociated values to be sorted (through keys).
[in]planConfiguration information for RadixSort.
[in]numElementsNumber of elements in the sort.
template<bool flip>
void radixSortSingleBlock ( uint *  keys,
uint *  values,
uint  numElements 
)

Single-block optimization for sorts of fewer than 4 * CTA_SIZE elements.

Parameters:
[in,out]keysKeys to be sorted.
[in,out]valuesAssociated values to be sorted (through keys).
numElementsNumber of elements in the sort.
void radixSort ( uint *  keys,
uint *  values,
const CUDPPRadixSortPlan *  plan,
size_t  numElements,
bool  flipBits,
int  keyBits 
)

Main radix sort function.

Main radix sort function. Sorts in place in the keys and values arrays, but uses the other device arrays as temporary storage. All pointer parameters are device pointers. Uses cudppScan() for the prefix sum of radix counters.

Parameters:
[in,out]keysKeys to be sorted.
[in,out]valuesAssociated values to be sorted (through keys).
[in]planConfiguration information for RadixSort.
[in]numElementsNumber of elements in the sort.
[in]flipBitsIs set true if key datatype is a float (neg. numbers) for special float sorting operations.
[in]keyBitsNumber of interesting bits in the key
void radixSortFloatKeys ( float *  keys,
uint *  values,
const CUDPPRadixSortPlan *  plan,
size_t  numElements,
bool  negativeKeys,
int  keyBits 
)

Wrapper to call main radix sort function. For float configuration.

Calls the main radix sort function. For float configuration.

Parameters:
[in,out]keysKeys to be sorted.
[in,out]valuesAssociated values to be sorted (through keys).
[in]planConfiguration information for RadixSort.
[in]numElementsNumber of elements in the sort.
[in]negativeKeysIs set true if key datatype has neg. numbers.
[in]keyBitsNumber of interesting bits in the key
template<uint nbits, uint startbit, bool flip, bool unflip>
void radixSortStepKeysOnly ( uint *  keys,
const CUDPPRadixSortPlan *  plan,
uint  numElements 
)

Perform one step of the radix sort. Sorts by nbits key bits per step, starting at startbit.

Parameters:
[in,out]keysKeys to be sorted.
[in]planConfiguration information for RadixSort.
[in]numElementsNumber of elements in the sort.
template<bool flip>
void radixSortSingleBlockKeysOnly ( uint *  keys,
uint  numElements 
)

Optimization for sorts of fewer than 4 * CTA_SIZE elements (keys only).

Parameters:
[in,out]keysKeys to be sorted.
numElementsNumber of elements in the sort.
void radixSortKeysOnly ( uint *  keys,
const CUDPPRadixSortPlan *  plan,
bool  flipBits,
size_t  numElements,
int  keyBits 
)

Main radix sort function. For keys only configuration.

Main radix sort function. Sorts in place in the keys array, but uses the other device arrays as temporary storage. All pointer parameters are device pointers. Uses scan for the prefix sum of radix counters.

Parameters:
[in,out]keysKeys to be sorted.
[in]planConfiguration information for RadixSort.
[in]flipBitsIs set true if key datatype is a float (neg. numbers) for special float sorting operations.
[in]numElementsNumber of elements in the sort.
[in]keyBitsNumber of interesting bits in the key
void radixSortFloatKeysOnly ( float *  keys,
const CUDPPRadixSortPlan *  plan,
bool  negativeKeys,
size_t  numElements,
int  keyBits 
)

Wrapper to call main radix sort function. For floats and keys only.

Calls the radixSortKeysOnly function setting parameters for floats.

Parameters:
[in,out]keysKeys to be sorted.
[in]planConfiguration information for RadixSort.
[in]negativeKeysIs set true if key flipBits is to be true in radixSortKeysOnly().
[in]numElementsNumber of elements in the sort.
[in]keyBitsNumber of interesting bits in the key
void allocRadixSortStorage ( CUDPPRadixSortPlan *  plan)

From the programmer-specified sort configuration, creates internal memory for performing the sort.

Parameters:
[in]planPointer to CUDPPRadixSortPlan object
void freeRadixSortStorage ( CUDPPRadixSortPlan *  plan)

Deallocates intermediate memory from allocRadixSortStorage.

Parameters:
[in]planPointer to CUDPPRadixSortPlan object
void cudppRadixSortDispatch ( void *  keys,
void *  values,
size_t  numElements,
int  keyBits,
const CUDPPRadixSortPlan *  plan 
)

Dispatch function to perform a sort on an array with a specified configuration.

This is the dispatch routine which calls radixSort...() with appropriate template parameters and arguments as specified by the plan.

Parameters:
[in,out]keysKeys to be sorted.
[in,out]valuesAssociated values to be sorted (through keys).
[in]numElementsNumber of elements in the sort.
[in]keyBitsNumber of interesting bits in the key*
[in]planConfiguration information for RadixSort.
template<class T , bool isBackward, bool isExclusive, CUDPPOperator op>
void scanArrayRecursive ( T *  d_out,
const T *  d_in,
T **  d_blockSums,
size_t  numElements,
size_t  numRows,
const size_t *  rowPitches,
int  level 
)

Perform recursive scan on arbitrary size arrays.

This is the CPU-side workhorse function of the scan engine. This function invokes the CUDA kernels which perform the scan on individual blocks.

Scans of large arrays must be split (possibly recursively) into a hierarchy of block scans, where each block is scanned by a single CUDA thread block. At each recursive level of the scanArrayRecursive first invokes a kernel to scan all blocks of that level, and if the level has more than one block, it calls itself recursively. On returning from each recursive level, the total sum of each block from the level below is added to all elements of the corresponding block in this level. See "Parallel Prefix Sum (Scan) in CUDA" for more information (see References ).

Template parameter T is the datatype; isBackward specifies backward or forward scan; isExclusive specifies exclusive or inclusive scan, and op specifies the binary associative operator to be used.

Parameters:
[out]d_outThe output array for the scan results
[in]d_inThe input array to be scanned
[out]d_blockSumsArray of arrays of per-block sums (one array per recursive level, allocated by allocScanStorage())
[in]numElementsThe number of elements in the array to scan
[in]numRowsThe number of rows in the array to scan
[in]rowPitchesArray of row pitches (one array per recursive level, allocated by allocScanStorage())
[in]levelThe current recursive level of the scan
void allocScanStorage ( CUDPPScanPlan plan)

Allocate intermediate arrays used by scan.

Scans of large arrays must be split (possibly recursively) into a hierarchy of block scans, where each block is scanned by a single CUDA thread block. At each recursive level of the scan, we need an array in which to store the total sums of all blocks in that level. This function computes the amount of storage needed and allocates it.

Parameters:
planPointer to CUDPPScanPlan object containing options and number of elements, which is used to compute storage requirements, and within which intermediate storage is allocated.
void freeScanStorage ( CUDPPScanPlan plan)

Deallocate intermediate block sums arrays in a CUDPPScanPlan object.

These arrays must have been allocated by allocScanStorage(), which is called by the constructor of cudppScanPlan().

Parameters:
planPointer to CUDPPScanPlan object initialized by allocScanStorage().
void cudppScanDispatch ( void *  d_out,
const void *  d_in,
size_t  numElements,
size_t  numRows,
const CUDPPScanPlan plan 
)

Dispatch function to perform a scan (prefix sum) on an array with the specified configuration.

This is the dispatch routine which calls scanArrayRecursive() with appropriate template parameters and arguments to achieve the scan as specified in plan.

Parameters:
[out]d_outThe output array of scan results
[in]d_inThe input array
[in]numElementsThe number of elements to scan
[in]numRowsThe number of rows to scan in parallel
[in]planPointer to CUDPPScanPlan object containing scan options and intermediate storage
template<class T , CUDPPOperator op, bool isBackward, bool isExclusive, bool doShiftFlagsLeft>
void segmentedScanArrayRecursive ( T *  d_out,
const T *  d_idata,
const unsigned int *  d_iflags,
T **  d_blockSums,
unsigned int **  d_blockFlags,
unsigned int **  d_blockIndices,
int  numElements,
int  level 
)

Perform recursive scan on arbitrary size arrays.

This is the CPU-side workhorse function of the segmented scan engine. This function invokes the CUDA kernels which perform the segmented scan on individual blocks.

Scans of large arrays must be split (possibly recursively) into a hierarchy of block scans, where each block is scanned by a single CUDA thread block. At each recursive level of the segmentedScanArrayRecursive first invokes a kernel to scan all blocks of that level, and if the level has more than one block, it calls itself recursively. On returning from each recursive level, the total sum of each block from the level below is added to all elements of the first segment of the corresponding block in this level.

Template parameter T is the data type of the input data. Template parameter op is the binary operator of the segmented scan. Template parameter isBackward specifies whether the direction is backward (not implemented). It is forward if it is false. Template parameter isExclusive specifies whether the segmented scan is exclusive (true) or inclusive (false).

Parameters:
[out]d_outThe output array for the segmented scan results
[in]d_idataThe input array to be scanned
[in]d_iflagsThe input flags vector which specifies the segments. The first element of a segment is marked by a 1 in the corresponding position in d_iflags vector. All other elements of d_iflags is 0.
[out]d_blockSumsArray of arrays of per-block sums (one array per recursive level, allocated by allocScanStorage())
[out]d_blockFlagsArray of arrays of per-block OR-reductions of flags (one array per recursive level, allocated by allocScanStorage())
[out]d_blockIndicesArray of arrays of per-block min-reductions of indices (one array per recursive level, allocated by allocSegmentedScanStorage()). An index for a particular position i in a block is calculated as - if d_iflags[i] is set then it is the 1-based index of that position (i.e if d_iflags[10] is set then index is 11) otherwise the index is INT_MAX (the identity element of a min operator)
[in]numElementsThe number of elements in the array to scan
[in]levelThe current recursive level of the scan
void allocSegmentedScanStorage ( CUDPPSegmentedScanPlan plan)

Allocate intermediate block sums, block flags and block indices arrays in a CUDPPSegmentedScanPlan class.

Segmented scans of large arrays must be split (possibly recursively) into a hierarchy of block segmented scans, where each block is scanned by a single CUDA thread block. At each recursive level of the scan, we need an array in which to store the total sums of all blocks in that level. Also at this level we have two more arrays - one which contains the OR-reductions of flags of all blocks at that level and the second which contains the min-reductions of indices of all blocks at that levels This function computes the amount of storage needed and allocates it.

Parameters:
[in]planPointer to CUDPPSegmentedScanPlan object containing segmented scan options and number of elements, which is used to compute storage requirements.
void freeSegmentedScanStorage ( CUDPPSegmentedScanPlan plan)

Deallocate intermediate block sums, block flags and block indices arrays in a CUDPPSegmentedScanPlan class.

These arrays must have been allocated by allocSegmentedScanStorage(), which is called by the constructor of CUDPPSegmentedScanPlan.

Parameters:
[in]planCUDPPSegmentedScanPlan class initialized by its constructor.
void cudppSegmentedScanDispatch ( void *  d_out,
const void *  d_idata,
const unsigned int *  d_iflags,
int  numElements,
const CUDPPSegmentedScanPlan plan 
)

Dispatch function to perform a scan (prefix sum) on an array with the specified configuration.

This is the dispatch routine which calls segmentedScanArrayRecursive() with appropriate template parameters and arguments to achieve the scan as specified in plan.

Parameters:
[in]numElementsThe number of elements to scan
[in]planSegmented Scan configuration (plan), initialized by CUDPPSegmentedScanPlan constructor
[in]d_idataThe input array
[in]d_iflagsThe input flags array
[out]d_outThe output array of segmented scan results
template<class T >
void sparseMatrixVectorMultiply ( T *  d_y,
const T *  d_x,
const CUDPPSparseMatrixVectorMultiplyPlan plan 
)

Perform matrix-vector multiply for sparse matrices and vectors of arbitrary size.

This function performs the sparse matrix-vector multiply by executing four steps.

1. The sparseMatrixVectorFetchAndMultiply() kernel does an element-wise multiplication of a each element e in CUDPPSparseMatrixVectorMultiplyPlan::m_d_A with the corresponding (i.e. in the same row as the column index of e in CUDPPSparseMatrixVectorMultiplyPlan::m_d_A) element in d_x and stores the product in CUDPPSparseMatrixVectorMultiplyPlan::m_d_prod. It also sets all elements of CUDPPSparseMatrixVectorMultiplyPlan::m_d_flags to 0.

2. The sparseMatrixVectorSetFlags() kernel iterates over each element in CUDPPSparseMatrixVectorMultiplyPlan::m_d_rowIndex and sets the corresponding position (indicated by CUDPPSparseMatrixVectorMultiplyPlan::m_d_rowIndex) in CUDPPSparseMatrixVectorMultiplyPlan::m_d_flags to 1.

3. Perform a segmented scan on CUDPPSparseMatrixVectorMultiplyPlan::m_d_prod with CUDPPSparseMatrixVectorMultiplyPlan::m_d_flags as the flag vector. The output is stored in CUDPPSparseMatrixVectorMultiplyPlan::m_d_prod.

4. The yGather() kernel goes over each element in CUDPPSparseMatrixVectorMultiplyPlan::m_d_rowFinalIndex and picks the corresponding element (indicated by CUDPPSparseMatrixVectorMultiplyPlan::m_d_rowFinalIndex) element from CUDPPSparseMatrixVectorMultiplyPlan::m_d_prod and stores it in d_y.

Parameters:
[out]d_yThe output array for the sparse matrix-vector multiply (y vector)
[in]d_xThe input x vector
[in]planPointer to the CUDPPSparseMatrixVectorMultiplyPlan object which stores the configuration and pointers to temporary buffers needed by this routine
void allocSparseMatrixVectorMultiplyStorage ( CUDPPSparseMatrixVectorMultiplyPlan plan,
const void *  A,
const unsigned int *  rowindx,
const unsigned int *  indx 
)

Allocate intermediate product, flags and rowFindx (index of the last element of each row) array .

Parameters:
[in]planPointer to CUDPPSparseMatrixVectorMultiplyPlan class containing sparse matrix-vector multiply options, number of non-zero elements and number of rows which is used to compute storage requirements
[in]AThe matrix A
[in]rowindxThe indices of elements in A which are the first element of their row
[in]indxThe column number for each element in A
void freeSparseMatrixVectorMultiplyStorage ( CUDPPSparseMatrixVectorMultiplyPlan plan)

Deallocate intermediate product, flags and rowFindx (index of the last element of each row) array .

These arrays must have been allocated by allocSparseMatrixVectorMultiplyStorage(), which is called by the constructor of CUDPPSparseMatrixVectorMultiplyPlan.

Parameters:
[in]planPointer to CUDPPSparseMatrixVectorMultiplyPlan plan initialized by its constructor.
void cudppSparseMatrixVectorMultiplyDispatch ( void *  d_y,
const void *  d_x,
const CUDPPSparseMatrixVectorMultiplyPlan plan 
)

Dispatch function to perform a sparse matrix-vector multiply with the specified configuration.

This is the dispatch routine which calls sparseMatrixVectorMultiply() with appropriate template parameters and arguments

Parameters:
[out]d_yThe output vector for y = A*x
[in]d_xThe x vector for y = A*x
[in]planThe sparse matrix plan and data
 All Classes Files Functions Variables Enumerations Enumerator Defines