cuPoisson: a library to solve Poisson's equation in a cluster of Nvidia CUDA GPUs

cuPoisson is a library to solve Poisson's equation in a cluster of Nvidia CUDA GPUs. It is written in CUDA C and all its functionality is exposed through C functions. The implementation and some execution results are described in this article: Jose L. Jodra, Ibai Gurrutxaga, Javier Muguerza and Ainhoa Yera. "Solving Poisson's equation using FFT in a GPU cluster", Journal of Parallel and Distributed Computing, vol. 102 , pp. 28-36, 2017. Please cite us in your publications if you use cuPoisson.

The last version of the library is located at http://www.aldapa.eus/res/cuPoisson/.

This document shows how to build and use this library.

Index

Installation

To build this library you will need the Nvidia CUDA SDK and the CMAKE building system installed. You have to specify the build configuration through CMake. The most important configuration elements are:

We recommend NOT to build the library in the source code tree, so you should create a new folder. For example, you can type the following commands in a shell:

mkdir build
cd build
ccmake ..
make

This commands build the code and create 2 important files for you in the build folder.

The installation process is not implemented yet, so you must manage these files yourself.

Using the library

The library currently allows solving the Poisson's equation in just 3-dimensional periodic grids. It is designed in a way that the library users need not writing any CUDA code directly. The serial module of the library implements the solver that runs in a single node and does not use MPI. Do not be mislead by its name, since it is noway serial: it can run in several GPUs in the same node using POSIX threads. Furthermore, the execution in each GPU is obviously parallel.

Before using the library you must call to the function cup_init. This function initializes the library and allows the user to specify which GPU devices will be used. If the MPI version is going to be used a single GPU must be specified per MPI process. The function cup_finish will free the resources used by the library. cup_is_init can be used to check whether the library is initialized or not.

The cup_malloc function allocates pinned memory, so GPU/CPU transfers can be faster. Memory allocated this way must be liberated by cup_free. The function cup_error_string can be used to translate the library's error codes to human readable descriptions.

There are two functions that must be used to manage the data grids used by the library: cup_create_grid and cup_destroy_grid. The first function sets the properties of the grid (number of dimensions, number of points and size), but not the actual data.

A serial (non-MPI) solver can be created calling to cup_create_solver and destroyed calling to cup_destroy_solver. Between these two calls cup_exec_solver can be called as many times as needed. Since the call to cup_exec_solver does not block the calling CPU thread, it can check the solver status calling to cup_is_solver_ready. Similarly, it can block until the solver has finished calling to cup_wait_solver.

The library user must provide CPU pointers to the input and output data in the call to cup_exec_solver and the library copies the data to and from the GPU. If any of the pointers is NULL, the library assumes that the data is already in the GPU. This way, a solver's execution output can be the input of the next execution. The user can access to the solver's output data in the GPU calling to cup_get_solver_data. The result will be an array of GPU pointers (one per GPU). The management of these pointers require the library user to make direct use of CUDA, so prior knowledge of the CUDA platform and its usage is required.

A MPI solver can be created calling to cup_mpi_create_solver and destroyed calling to cup_mpi_destroy_solver. Between these two calls cup_mpi_exec_solver can be called as many times as needed. By default, the calls to cup_mpi_exec_solver block the calling CPU thread. If non-blocking calls are needed, the environment variable CUP_MPI_NONBLOCKING can be set to non-zero. Note that this option requires an MPI implementation that provides MPI_THREAD_SERIALIZED level of thread support. The non-blocked CPU thread can check the solver status calling to cup_mpi_is_solver_ready and, similarly, it can block until the solver has finished calling to cup_mpi_wait_solver.

The data transfer occurs as in the Serial module. If the user wants to access to the solver's output data in the GPU, it can get an array of just one GPU pointer calling to cup_mpi_get_solver_data.

The equation solving process is segmented to overlap CPU/GPU data transfer, MPI communication and kernel execution. Each segment is composed of an integer amount of grid planes. The user is allowed to set the segment size setting the value of the CUP_MPI_SEGMENT_SIZE environment variable to the minimum segment size in bytes (K, M and G suffixes are allowed for KiB, MiB and GiB, respectively). A zero value will perform a non segmented execution. The default segment size is 128K.

Copyright license

The cuPoisson library is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. The cuPoisson library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Copyright 2016 University of the Basque Country, UPV/EHU.

Contact information

You can contact the authors at i.gurrutxaga@ehu.eus, j.muguerza@ehu.eus and joseluis.jodra@ehu.eus.