Introduction

CUDPP is the CUDA Data Parallel Primitives Library. CUDPP is a library of data-parallel algorithm primitives such as parallel-prefix-sum ("scan"), parallel sort and parallel reduction. Primitives such as these are important building blocks for a wide variety of data-parallel algorithms, including sorting, stream compaction, and building data structures such as trees and summed-area tables.

Overview Presentation

A brief set of slides that describe the features, design principles, applications and impact of CUDPP is available: CUDPP Presentation.

Home Page

Homepage for CUDPP: http://cudpp.github.io/

Announcements and discussion of CUDPP are hosted on the CUDPP Google Group.

Getting Started with CUDPP

You may want to start by browsing the CUDPP Public Interface. For information on building CUDPP, see Building CUDPP. See Overview of CUDPP hash tables for an overview of CUDPP's hash table support.

The "apps" subdirectory included with CUDPP has a few source code samples that use CUDPP:

simpleCUDPP, a simple example of using cudppScan()
satGL, an example of using cudppMultiScan() to generate a summed-area table (SAT) of a scene rendered in real time. The SAT is then used to simulate depth of field blur. This example is not currently working due to CUDA graphics interop changes. Volunteers to update it welcome!
cudpp_testrig, a comprehensive test application for all the functionality of CUDPP
cudpp_hash_testrig, a comprehensive test application for CUDPP's hash table data structures

We have also provided a code walkthrough of the simpleCUDPP example.

Getting Help and Reporting Problems

To get help using CUDPP, please use the CUDPP Google Group.

To report CUDPP bugs or request features, please file an issue directly using Github.

Release Notes

For specific release details see the Change Log.

Known Issues

For a complete list of issues, see the CUDPP issues list on Github.

There is a known issue that the compile time for CUDPP is very long and the compiled library file size is very large. On some systems with < 4GB of available memory (or virtual memory: e.g. 32-bit OS), the CUDA compiler can run out of memory and compilation can fail. We will be working on these issues for future releases. You can reduce compile time by only targetting GPU architectures that you plan to run on, using the CUDPP_GENCODE_* CMake options.

Algorithm Input Size Limitations

The following maximum size limitations currently apply. In some cases this is the theory—the algorithms may not have been tested to the maximum size. Also, for things like 32-bit integer scans, precision often limits the useful maximum size.

Algorithm	Maximum Supported Size
CUDPP_SCAN	67,107,840 elements
CUDPP_SEGMENTED_SCAN	67,107,840 elements
CUDPP_COMPACT	67,107,840 elements
CUDPP_COMPRESS	1,048,576 elements
CUDPP_LISTRANK	NO LIMIT
CUDPP_MTF	1,048,576 elements
CUDPP_BWT	1,048,576 elements
CUDPP_STRINGSORT	2,147,450,880 elements
CUDPP_MERGESORT	2,147,450,880 elements
CUDPP_REDUCE	NO LIMIT
CUDPP_RAND	33,554,432 elements
CUDPP_SPMVMULT	67,107,840 non-zero elements
CUDPP_HASH	See Hash Space Limitations
CUDPP_TRIDIAGONAL	65535 systems, 1024 equations per system (Compute capability 2.x), 512 equations per system (Compute capability < 2.0)

Operating System Support and Requirements

This release (2.1) has been thoroughly tested on the following OSes. For more information, visit our test results page.

Windows 7 (64-bit) (CUDA 5.0)
Ubuntu Linux (64-bit) (CUDA 5.0, CUDA 5.5)
Mac OS X 10.8.3 (Snow Leopard/Lion, 64-bit) (CUDA 5.0, CUDA 5.5)

We expect CUDPP to build and run correctly on other flavors of Linux and Windows, but only the above are actively tested at this time. Version 2.1 does not currently support 32-bit operating systems.

Requirements

CUDPP 2.1 has not been tested with any CUDA version < 5.0.

CUDA

CUDPP is implemented in CUDA C/C++. It requires the CUDA Toolkit. Please see the NVIDIA CUDA homepage to download CUDA as well as the CUDA Programming Guide and CUDA SDK, which includes many CUDA code examples.

Design Goals

Design goals for CUDPP include:

Performance. We aim to provide best-of-class performance for our primitives. We welcome suggestions and contributions that will improve CUDPP performance. We also want to provide primitives that can be easily benchmarked, and compared against other implementations on GPUs and other processors.
Modularity. We want our primitives to be easily included in other applications. To that end we have made the following design decisions:
- CUDPP is provided as a library that can link against other applications.
- CUDPP calls run on the GPU on GPU data. Thus they can be used as standalone calls on the GPU (on GPU data initialized by the calling application) and, more importantly, as GPU components in larger CPU/GPU applications.
CUDPP is implemented as 4 layers:
- The Public Interface is the external library interface, which is the intended entry point for most applications. The public interface calls into the Application-Level API.
- The Application-Level API comprises functions callable from CPU code. These functions execute code jointly on the CPU (host) and the GPU by calling into the Kernel-Level API below them.
- The Kernel-Level API comprises functionsthat run entirely on the GPU across an entire grid of thread blocks. These functions may call into the CTA-Level API below them.
- The CTA-Level API comprises functions that run entirely on the GPU within a single Cooperative Thread Array (CTA, aka a CUDA thread block). These are low-level functions that implement core data-parallel algorithms, typically by processing data within shared (CUDA __shared__) memory.

Programmers may use any of the lower three CUDPP layers in their own programs by building the source directly into their application. However, the typical usage of CUDPP is to link to the library and invoke functions in the CUDPP Public Interface, as in the simpleCUDPP, satGL, cudpp_testrig, and cudpp_hash_testrig application examples included in the CUDPP distribution.

Use Cases

We expect the normal use of CUDPP will be in one of two ways:

Linking the CUDPP library against another application.
Running the "test" applications, cudpp_testrig and cudpp_hash_testrig, that exercise CUDPP functionality.

References

The following publications describe work incorporated in CUDPP.

Mark Harris, Shubhabrata Sengupta, and John D. Owens. "Parallel Prefix Sum (Scan) with CUDA". In Hubert Nguyen, editor, GPU Gems 3, chapter 39, pages 851–876. Addison Wesley, August 2007. http://www.idav.ucdavis.edu/publications/print_pub?pub_id=916
Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens. "Scan Primitives for GPU Computing". In Graphics Hardware 2007, pages 97–106, August 2007. http://www.idav.ucdavis.edu/publications/print_pub?pub_id=915
Nadathur Satish, Mark Harris, and Michael Garland. "Designing Efficient Sorting Algorithms for Manycore GPUs". In Proceedings of the 23rd IEEE International Parallel & Distributed Processing Symposium, May 2009. http://mgarland.org/papers.html#gpusort
Stanley Tzeng, Li-Yi Wei. "Parallel White Noise Generation on a GPU via Cryptographic Hash". In Proceedings of the 2008 Symposium on Interactive 3D Graphics and Games, pages 79–87, February 2008. http://research.microsoft.com/apps/pubs/default.aspx?id=70502
Yao Zhang, Jonathan Cohen, and John D. Owens. Fast Tridiagonal Solvers on the GPU. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2010), pages 127–136, January 2010. http://www.cs.ucdavis.edu/publications/print_pub?pub_id=978
Yao Zhang, Jonathan Cohen, Andrew A. Davidson, and John D. Owens. A Hybrid Method for Solving Tridiagonal Systems on the GPU. In Wen-mei W. Hwu, editor, GPU Computing Gems. Morgan Kaufmann. July 2011.
Shubhabrata Sengupta, Mark Harris, Michael Garland, and John D. Owens. "Efficient Parallel Scan Algorithms for many-core GPUs". In Jakub Kurzak, David A. Bader, and Jack Dongarra, editors, Scientific Computing with Multicore and Accelerators, Chapman & Hall/CRC Computational Science, chapter 19, pages 413–442. Taylor & Francis, January 2011. http://www.idav.ucdavis.edu/publications/print_pub?pub_id=1041
Dan A. Alcantara, Andrei Sharf, Fatemeh Abbasinejad, Shubhabrata Sengupta, Michael Mitzenmacher, John D. Owens, and Nina Amenta. Real-Time Parallel Hashing on the GPU. ACM Transactions on Graphics, 28(5):154:1–154:9, December 2009. http://www.idav.ucdavis.edu/publications/print_pub?pub_id=973
Dan A. Alcantara, Vasily Volkov, Shubhabrata Sengupta, Michael Mitzenmacher, John D. Owens, and Nina Amenta. Building an Efficient Hash Table on the GPU. In Wen-mei W. Hwu, editor, GPU Computing Gems, volume 2, chapter 1. Morgan Kaufmann, August 2011.
Ritesh A. Patel, Yao Zhang, Jason Mak, Andrew Davidson, John D. Owens. "Parallel Lossless Data Compression on the GPU". In Proceedings of Innovative Parallel Computing (InPar '12), May 2012. http://idav.ucdavis.edu/publications/print_pub?pub_id=1087
Andrew Davidson, David Tarjan, Michael Garland, and John D. Owens. Efficient Parallel Merge Sort for Fixed and Variable Length Keys. In Proceedings of Innovative Parallel Computing (InPar '12), May 2012. http://www.idav.ucdavis.edu/publications/print_pub?pub_id=1085

Many researchers are using CUDPP in their work, and there are many publications that have used it (references). If your work uses CUDPP, please let us know by sending us a reference (preferably in BibTeX format) to your work.

Citing CUDPP

If you make use of CUDPP primitives in your work and want to cite CUDPP (thanks!), we would prefer for you to cite the appropriate papers above, since they form the core of CUDPP. To be more specific, the GPU Gems paper (Harris et al.) describes (unsegmented) scan, multi-scan for summed-area tables, and stream compaction. The Sengupta et al. book chapter describes the current scan and segmented scan algorithms used in the library, and the Sengupta et al. Graphics Hardware paper describes an earlier implementation of segmented scan, quicksort, and sparse matrix-vector multiply. The IPDPS paper (Satish et al.) describes the radix sort used in CUDPP (prior to CUDPP 2.0. Later releases use Thrust::sort), and the I3D paper (Tzeng and Wei) describes the random number generation algorithm. The two Alcantara papers describe the hash algorithms. The two Zhang papers describe the tridiagonal solvers.

Credits

CUDPP Developers

Mark Harris, NVIDIA Corporation
John D. Owens, University of California, Davis
Shubho Sengupta, University of California, Davis
Stanley Tzeng, University of California, Davis
Yao Zhang, University of California, Davis
Andrew Davidson, University of California, Davis
Ritesh Patel, University of California, Davis

Other CUDPP Contributors

Anjul Patney, University of California, Davis [general help]
Edmund Yan, University of California, Davis [Release Manager]
Dan Alcantara, University of California, Davis [hash tables]
Nadatur Satish, University of California, Berkeley [(old)radix sort]

Acknowledgments

Thanks to Jim Ahrens, Timo Aila, Nathan Bell, Ian Buck, Guy Blelloch, Jeff Bolz, Michael Garland, Jeff Inman, Eric Lengyel, Samuli Laine, David Luebke, Pat McCormick, Duane Merrill, and Richard Vuduc for their contributions during the development of this library.

CUDPP Developers from UC Davis thank their funding agencies:

National Science Foundation (grants CCF-0541448, IIS-0964357, and particularly OCI-1032859)
Department of Energy Early Career Principal Investigator Award DE-FG02-04ER25609
SciDAC Institute for Ultrascale Visualization (http://www.iusv.org/)
Los Alamos National Laboratory
Generous hardware donations from NVIDIA

CUDPP Copyright and Software License

CUDPP is copyright The Regents of the University of California, Davis campus and NVIDIA Corporation. The library, examples, and all source code are released under the BSD license, designed to encourage reuse of this software in other projects, both commercial and non-commercial. For details, please see the license page.

Non source-code content (such as documentation, web pages, etc.) from CUDPP is distributed under a Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0) license.

Note that prior to release 1.1 of CUDPP, the license used was a modified BSD license. With release 1.1, this license was replaced with the pure BSD license to facilitate the use of open source hosting of the code.

CUDPP also includes the Mersenne twister code of Makoto Matsumoto, also licensed under BSD.

CUDPP also calls functions in the Thrust template library, which is included with the CUDA Toolkit and licensed under the Apache 2.0 open source license.

CUDPP also includes a modified version of FindGLEW.cmake from nvidia-texture-tools, licensed under the MIT license.