From: Thomas Albers (talbers_at_binghamton.edu)
Date: Sun Apr 29 2012 - 08:33:54 CDT
Hello!
On Fri, Apr 27, 2012 at 3:10 PM, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>
> The 2.9 CUDA version is optimized for smp/multicore builds and in general
> the GPU runs more efficiently with a single context. I think the effect you
> are seeing is due to a fortuitous staggering of processors that improves
> overlap, particularly for constant volume simulations. In any case, I would
> suggest trying an smp binary (use +p24 ++ppn 3) and you can always recover
> an approximation of the old behavior with +devices.
Under those conditions we see on our equipment:
Benchmark time: 24 CPUs 0.0440255
It seems with four cores we haven't got one to spare for the
communication thread.
Thomas
>>> We have a cluster consisting of 8 AMD Phenom II x4 computers with GTX
>>> 460 video card linked with SDR Infiniband,
>>
>> ..
>> Some timing results, all with the F1ATPase benchmark:
>>>
>>> NAMD 2.9b2, compiled w/ gcc 4.5.3, 32 cores: 0.065 s/step
>>> NAMD 2.8 Linux-x86_64-ibverbs-CUDA, 32 cores: 0.039 s/step
>>>
>>> be affected. It's only the CUDA version of NAMD 2.9 that shows this
>>> odd scaling behavior. What is going on?
>>
>>
>> What went on is that between NAMD 2.8 and 2.9 the method of assigning
>> threads to GPUs has changed.
>>
>> NAMD 2.9b3-ibverbs-CUDA, 32 cores, invoked with +devices 0,0,0,0: 0.039
>> s/step
>> NAMD 2.9b3-ibverbs-CUDA, 32 cores, invoked with +devices 0,0: 0.048 s/step
>> NAMD 2.9b3-ibverbs-CUDA, 32 cores, invoked with +devices 0: 0.065 s/step
>>
>> I would be interested to hear from the developers what the reason for
>> this change of default behaviour is, on what kind of hardware does it
>> improve performance.
>>
>> Regards,
>> Thomas
This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:28 CST