topicbenchmarking [Neuronal GPU Computing Workshop 2012]

Description

The Tesla cards support memory error correction (ECC), which is on by default. ECC is known to come with a significant cost on performance. Regarding the surprisingly good performance of the NVidia GTX 580 in Andeas' benchmarks, we hypothesized that the reason might be that the Tesla card Andreas used had ECC turned on, and that it would be significantly faster with ECC turned off, probably faster than the GTX580.

Results and Discussion

We used NeMo to benchmark the Tesla C2070 in Michael's lab with ECC turned on vs. ECC turned off, and compared it to a GTX 570. We used the torus benchmark from NeMo's example suite. The torus was parameterized with 30 partitions, 1000 synapses, for 1000 ms. The standard deviation for the connectivity probability was set to 64. The call to the torus binary was:

torus --benchmark -t 1000 --cuda -n 30 -m 1000 -s 64 -v 2

We measured the throughput in million synaptic events per second (MSEPs). Here are the results:

Device	MSEPs
Tesla C2070 (ECC on)	437 MSEPs
Tesla C2070 (ECC off)	468 MSEPs
GTX 570	402 MSEPs

Turning off ECC on the Tesla resulted in a speedup of a bit less than 10 %. The GTX 570 was approximately 20 % slower than the Tesla C2070 without ECC. So there is a significant speedup when turning of ECC.

Interestingly, both the GTX 570 and the Tesla were considerably slower than Andreas' GTX 580. This is likely due to higher memory bandwidth on the GTX580. The GTX580 has a bandwidth of 192 GByte/s while the GTX 570 has 152 GByte/s. The Tesla C2070 is at 144 GByte/s.

A very important factor when comparing performance between consumer grade cards and the Tesla cards is that consumer cards double only support single precision floating point operations directly. Double precision calculations can be carried out, but take four cycles (instead of one cycle on the tesla). This restriction is introduced artificially in the driver - the hardware of the consumer cards would support double precision arithmetic. NeMo uses single precision calculations (and also quite some fixed-point arithmetic), so it doesn't fully exploit the Tesla's potential. Another simulator that uses double precision would likely be much slower on the consumer card than on the Tesla.

Outlook

It would be desirable to benchmark the cards with a program that does double precision computations. GeNN is a candidate, although in its current version only variables internal to the neuron models are double precision. Synapses (that is, weights) are single-precision floats.

Back to workshop topics