Testing GPU functionality

This section provides information and links that help with testing CoreNEURON’s GPU support. Other sections of the documentation that may be relevant are:

  • The Getting CoreNEURON section, which documents both building from source with CoreNEURON support and installing Python wheels.
  • The Running a simulation section, which explains the basics of porting a NEURON model to use CoreNEURON.
  • The Running GPU benchmarks section, which outlines how to use profiling tools such as Caliper, NVIDIA NSight Systems, and NVIDIA NSight Compute.

This section aims to add some basic information about how to test if GPU execution is working. This might be useful if, for example, you need to test a change to the GPU wheel building, or test GPU execution on a new system.

Accessing GPU resources

If your local system has an (NVIDIA) GPU installed then you can probably skip this section. The nvidia-smi tool may be useful to check this; it will show the GPUs attached to a system:

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P2200        Off  | 00000000:01:00.0 Off |                  N/A |
| 45%   33C    P8     4W /  75W |     71MiB /  5049MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

On a university cluster or supercomputer system then you will typically need to pass some kind of extra constraint to the job scheduler. For example on the BlueBrain5 system, which uses Slurm, you can allocate a GPU node using the volta constraint:

[login node] $ salloc -A <account> -C volta
salloc: Granted job allocation 294001
...
[compute node] $ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:1A:00.0 Off |                  Off |
...

Running NEURON tests

If you have configured NEURON with CoreNEURON, CoreNEURON GPU support and tests (-DNRN_ENABLE_TESTS=ON) enabled then simply running

$ ctest --output-on-failure

in your CMake build directory will execute a large number of tests, many of them including GPU execution. You can filter which tests are run by name using the -R option to CTest, for example:

$ ctest --output-on-failure -R gpu
Test project /path/to/your/build
Start  42: coreneuron_modtests::direct_py_gpu
 1/53 Test  #42: coreneuron_modtests::direct_py_gpu .............................   Passed    1.98 sec
      Start  43: coreneuron_modtests::direct_hoc_gpu
 2/53 Test  #43: coreneuron_modtests::direct_hoc_gpu ............................   Passed    1.03 sec
      Start  44: coreneuron_modtests::spikes_py_gpu
 ...

Running tests manually

It is sometimes convenient to run basic tests outside the CTest infrastructure. A particularly useful test case is the ringtest that is included in the CoreNEURON repository. This is very convenient because binary input data files for CoreNEURON are committed to the repository – meaning that the test can be run without NEURON, Python, HOC, and friends – and the required mechanisms are compiled as part of the standard NEURON build. To run this test on CPU you can, from your build directory, run:

$ ./bin/x86_64/special-core -d ../external/coreneuron/tests/integration/ring
...

where it is assumed that .. is the source directory. To enable GPU execution, add the --gpu option:

$ ./bin/x86_64/special-core -d ../external/coreneuron/tests/integration/ring --gpu
Info : 4 GPUs shared by 1 ranks per node
...

You should see that the statistics printed at the end of the simulation are the same. It can also be useful to enable some basic profiling, for example by using NVIDIA’s NSight Systems utility nsys:

$ nsys nvprof ./bin/x86_64/special-core -d ../external/coreneuron/tests/integration/ring --gpu
WARNING: special-core and any of its children processes will be profiled.

Collecting data...
Info : 4 GPUs shared by 1 ranks per node
...
Number of spikes: 37
Number of spikes with non negative gid-s: 37
Processing events...
...
CUDA API Statistics:

Time(%)  Total Time (ns)  Num Calls  Average (ns)   Minimum (ns)  Maximum (ns)  StdDev (ns)             Name
-------  ---------------  ---------  -------------  ------------  ------------  -----------  --------------------------
   42.7    2,127,723,623    136,038       15,640.7         3,630    10,224,640     59,860.5  cuLaunchKernel
...

CUDA Kernel Statistics:

Time(%)  Total Time (ns)  Instances  Average (ns)  Minimum (ns)  Maximum (ns)  StdDev (ns)                                                  Name
-------  ---------------  ---------  ------------  ------------  ------------  -----------  ----------------------------------------------------------------------------------------------------
   32.3      346,133,763      8,000      43,266.7        42,175        50,080      1,435.3  nvkernel__ZN10coreneuron18solve_interleaved1Ei_F1L653_4
   12.7      136,155,806      8,002      17,015.2         3,615     1,099,738     90,544.0  nvkernel__ZN10coreneuron14nrn_cur_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_F1L375_7
   10.4      111,258,439      8,002      13,903.8         3,199     1,314,489     73,556.3  nvkernel__ZN10coreneuron11nrn_cur_pasEPNS_9NrnThreadEPNS_9Memb_listEi_F1L274_4
   10.1      108,647,844      8,000      13,581.0         3,391     1,274,394     70,309.4  nvkernel__ZN10coreneuron16nrn_state_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_F1L418_10
...

This can be helpful to confirm that compute kernels are really being launched on the GPU. Substrings such as solve_interleaved1, solve_interleaved2, nrn_cur_ and nrn_state_ in these kernel names indicate that the computationally heavy parts of the simulation are indeed being executed on the GPU. This test dataset is extremely small, so you should not pay much attention to the simulation time in this case.

Note

The kernel names, which start with nvkernel__ZN10coreneuron above, are implementation details of the OpenACC or OpenMP implementation being used. They can also depend on whether you use MOD2C or NMODL to translate MOD files. If you want to do any more sophisticated profiling then you should use a profiling tool such as Caliper that can access the well-defined human-readable names for these kernels that NEURON and CoreNEURON define.