Re: Errata to: Update to CUDA error in NAMD 2.7: Increase MAX_EXCLUSIONS: problem persists and CPU-only MD scales poorly

From: Pietro Amodeo (pamodeo_at_icmib.na.cnr.it)
Date: Tue Dec 28 2010 - 05:45:16 CST

Hi Axel,

thanks a lot for your reply.

On Mar, Dicembre 28, 2010, 4:46 am, Axel Kohlmeyer disse:
> hi pietro,
>
> On Mon, Dec 27, 2010 at 3:20 PM, Pietro Amodeo <pamodeo_at_icmib.na.cnr.it>
> wrote:
>> Hi,
>>
>> sorry but the table in my last post is wrong:
>> 1) obviously, the reported ratio is Time(1)/Time(N) and NOT
>> Time(N)/Time(1)!!!!
>> 2) the correct figures are:
>> N     Time(1)/Time(N)
>>  1    1
>>  2    1.9733511924
>>  4    3.5960034869
>>  6    5.1641581203
>>  8    6.5367137981
>> 10   8.0500773076
>> 12   9.1171710303
>>
>> 16   8.8086727989
>>
>> 20   9.6037249284
>> 22  10.103089676
>> 24  10.6848376171
>
> those timings are fairly good.
> i don't know what you are complaining about.
>
> you really have only 12 physical CPU cores on
> your machine and about 10-15% extra speed from
> hyper-threading is quite typical for this kind of setup.
>
> the fact that you don't get perfect scaling can be easily
> explained by two reasons: memory bandwidth contention
> overall and lack of processor affinity that makes the
> contention worse.
>
> memory contention is the worst the larger the system is
> as that makes CPU caches less efficient. overall, also
> the topology and size of caches has an impact to performance
> and scaling.

I was aware of the possibly poor speed up from HT and absolute times were
also quite good. What I was complaining about was only the scaling from 1
to 12 cores, especially in comparison with results obtained on 8-core
opteron processors on dual CPU nodes, where scaling was almost ideal up to
16 cores (>15x). However, as you clearly explained, memory bandwidth
contention and/or lack of processor affinity may easily account for the
worse scaling, even because, as I wrote in my last email, benchmarks on
the opteron cluster were run on a smaller protein+membrane system.

>
> as for your CUDA version problem. that looks like a compile
> time issue. you'll have to examine the source code and see,
> if you can adjust the mentioned parameter. on GPUs the
> memory (and cache) architecture is different from CPUs and
> sometimes one has to choose what works well for most
> typical cases and require a recompilation with changed
> parameters. due to continued improvements in the CUDA
> programming interface and the CUDA drivers, this situation
> will improve in the future (e.g. with JIT compilation and selection
> of kernels suitable for specific needs).

For the CUDA problem, I was wondering if MAX_EXCLUSIONS parameter could be
simply increased or architecture/CUDA issues limit its upper value. Also,
any info about dependencies involving this parameter could be useful. A
last question was the origin of the error, i.e. if it depends just on
overall system size or rather on a combination of atom/molecule
numbers/sizes. I'll try to work out my answers from the code.

cheers,
Pietro

-- 
Dr. Pietro Amodeo, Ph.D.
Istituto di Chimica Biomolecolare del CNR
Comprensorio "A. Olivetti", Edificio 70
Via Campi Flegrei 34
I-80078 Pozzuoli (Napoli) - Italy
Phone      +39-0818675072
Fax        +39-0818041770
Email    pamodeo_at_icmib.na.cnr.it

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:56:30 CST