Testing GPU functionality
This section provides information and links that help with testing CoreNEURON’s GPU support. Other sections of the documentation that may be relevant are:
- The Getting CoreNEURON section, which documents both building from source with CoreNEURON support and installing Python wheels.
- The Running a simulation section, which explains the basics of porting a NEURON model to use CoreNEURON.
- The Running GPU benchmarks section, which outlines how to use profiling tools such as Caliper, NVIDIA NSight Systems, and NVIDIA NSight Compute.
This section aims to add some basic information about how to test if GPU execution is working. This might be useful if, for example, you need to test a change to the GPU wheel building, or test GPU execution on a new system.
Accessing GPU resources
If your local system has an (NVIDIA) GPU installed then you can probably skip this section.
The nvidia-smi
tool may be useful to check this; it will show the GPUs attached to a system:
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P2200 Off | 00000000:01:00.0 Off | N/A |
| 45% 33C P8 4W / 75W | 71MiB / 5049MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
On a university cluster or supercomputer system then you will typically need to pass some kind of extra constraint to the job scheduler.
For example on the BlueBrain5 system, which uses Slurm, you can allocate a GPU node using the volta
constraint:
[login node] $ salloc -A <account> -C volta
salloc: Granted job allocation 294001
...
[compute node] $ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:1A:00.0 Off | Off |
...
Running NEURON tests
If you have configured NEURON with CoreNEURON, CoreNEURON GPU support and tests (-DNRN_ENABLE_TESTS=ON) enabled then simply running
$ ctest --output-on-failure
in your CMake build directory will execute a large number of tests, many of them including GPU execution.
You can filter which tests are run by name using the -R
option to CTest, for example:
$ ctest --output-on-failure -R gpu
Test project /path/to/your/build
Start 42: coreneuron_modtests::direct_py_gpu
1/53 Test #42: coreneuron_modtests::direct_py_gpu ............................. Passed 1.98 sec
Start 43: coreneuron_modtests::direct_hoc_gpu
2/53 Test #43: coreneuron_modtests::direct_hoc_gpu ............................ Passed 1.03 sec
Start 44: coreneuron_modtests::spikes_py_gpu
...
Running tests manually
It is sometimes convenient to run basic tests outside the CTest
infrastructure.
A particularly useful test case is the ringtest
that is included in
the CoreNEURON repository.
This is very convenient because binary input data files for CoreNEURON
are committed to the repository – meaning that the test can be run
without NEURON, Python, HOC, and friends – and the required mechanisms
are compiled as part of the standard NEURON build.
To run this test on CPU you can, from your build directory, run:
$ ./bin/x86_64/special-core -d ../external/coreneuron/tests/integration/ring
...
where it is assumed that ..
is the source directory.
To enable GPU execution, add the --gpu
option:
$ ./bin/x86_64/special-core -d ../external/coreneuron/tests/integration/ring --gpu
Info : 4 GPUs shared by 1 ranks per node
...
You should see that the statistics printed at the end of the simulation
are the same.
It can also be useful to enable some basic profiling, for example by using
NVIDIA’s NSight Systems utility nsys
:
$ nsys nvprof ./bin/x86_64/special-core -d ../external/coreneuron/tests/integration/ring --gpu
WARNING: special-core and any of its children processes will be profiled.
Collecting data...
Info : 4 GPUs shared by 1 ranks per node
...
Number of spikes: 37
Number of spikes with non negative gid-s: 37
Processing events...
...
CUDA API Statistics:
Time(%) Total Time (ns) Num Calls Average (ns) Minimum (ns) Maximum (ns) StdDev (ns) Name
------- --------------- --------- ------------- ------------ ------------ ----------- --------------------------
42.7 2,127,723,623 136,038 15,640.7 3,630 10,224,640 59,860.5 cuLaunchKernel
...
CUDA Kernel Statistics:
Time(%) Total Time (ns) Instances Average (ns) Minimum (ns) Maximum (ns) StdDev (ns) Name
------- --------------- --------- ------------ ------------ ------------ ----------- ----------------------------------------------------------------------------------------------------
32.3 346,133,763 8,000 43,266.7 42,175 50,080 1,435.3 nvkernel__ZN10coreneuron18solve_interleaved1Ei_F1L653_4
12.7 136,155,806 8,002 17,015.2 3,615 1,099,738 90,544.0 nvkernel__ZN10coreneuron14nrn_cur_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_F1L375_7
10.4 111,258,439 8,002 13,903.8 3,199 1,314,489 73,556.3 nvkernel__ZN10coreneuron11nrn_cur_pasEPNS_9NrnThreadEPNS_9Memb_listEi_F1L274_4
10.1 108,647,844 8,000 13,581.0 3,391 1,274,394 70,309.4 nvkernel__ZN10coreneuron16nrn_state_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_F1L418_10
...
This can be helpful to confirm that compute kernels are really being
launched on the GPU.
Substrings such as solve_interleaved1
, solve_interleaved2
,
nrn_cur_
and nrn_state_
in these kernel names indicate that the
computationally heavy parts of the simulation are indeed being executed
on the GPU.
This test dataset is extremely small, so you should not pay much
attention to the simulation time in this case.
Note
The kernel names, which start with nvkernel__ZN10coreneuron
above, are implementation details of the OpenACC or OpenMP
implementation being used.
They can also depend on whether you use MOD2C or NMODL to translate
MOD files.
If you want to do any more sophisticated profiling then you should
use a profiling tool such as Caliper that can access the
well-defined human-readable names for these kernels that NEURON and
CoreNEURON define.