AW: AW: 2CPU+1GPU vs 1CPU+2GPU

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Wed Feb 15 2012 - 01:45:32 CST

Next message: Nicholas M Glykos: "Re: AW: AW: 2CPU+1GPU vs 1CPU+2GPU"
Previous message: Wanzhi Qiu: "Graphene pdb/psf files cannot be read by psfgen"
In reply to: Axel Kohlmeyer: "Re: AW: 2CPU+1GPU vs 1CPU+2GPU"
Next in thread: Nicholas M Glykos: "Re: AW: AW: 2CPU+1GPU vs 1CPU+2GPU"
Reply: Nicholas M Glykos: "Re: AW: AW: 2CPU+1GPU vs 1CPU+2GPU"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Axel Kohlmeyer
> Gesendet: Dienstag, 14. Februar 2012 15:20
> An: Norman Geist
> Cc: Nicholas M Glykos; Namd Mailing List
> Betreff: Re: AW: namd-l: 2CPU+1GPU vs 1CPU+2GPU
>
> On Tue, Feb 14, 2012 at 7:14 AM, Norman Geist
> <norman.geist_at_uni-greifswald.de> wrote:
> > Hi Nicholas,
> >
> > yes, that's what I tried to point out. Nobody seems to really know.
> But most
> > people use ECC on CPUs in HPC also, there must be a reason for that.
> In case
>
> this is because there are no alternatives. there is no market
> for data center hardware without. please keep in mind, that
> the bulk of the data center hardware is sold not to researchers,
> but to companies and they *do* need as perfect reliability
> as they can afford.

One mostly can buy all these Machines without ECC, too. And don't researches
need reliability also?

>
> > of NAMD I would guess it's a question of system size following your
> > explanations. For really small mechanics it could matter then, a
> bigger can
> > better offset. ECC corrects flipped bits, and in the binary system
> this can
> > cause little or dramatic change. The question is, can they change
> results. I
> > think it’s a difference if a number is 00000001(1) or 10000001(129)
> to only
> > show a byte number and these flips can occur everywhere in the RAM
> also on
> > forcefield parameters that are stored there.
>
> i congratulate you on being a subject to the recent FUD
> tactics of computer sales droids (and indirectly of nvidia,
> since most of those guys don't know what they are talking
> about and just quote what they get told).
>
> yes, bitflips can have a dramatic effect, but i consider them
> negligible compared to all the *systematic* errors that you
> are including in your calculations without worrying. e.g.
> there are the truncation errors through cutoffs, there are
> the errors through using a multiple time step integrator,
> and using discrete time steps in the first place, and with
> using GPUs, you have a more significant truncation error
> through using single precision math on the GPU.
> compared to that, the worry about bitflips is small change.

I would expect completely randomized errors to be more sneaky than
truncations.
But maybe that’s more important in QM.

>
> on top of that, in my personal experience (and i do you
> a machine with 32 Tesla C2050s, am and have been using
> several GeForce and also older C1060 and S1070 GPUs)
> the typical scenario is that either you have very, very rare
> random bit flips (i have not seen an ECC error flagged on
> our hardware for a very long time) or you have a damaged
> memory cell and *that* you can catch quickly, e.g. by
> running cuda_memtest for one iteration (which is what
> people did with older hardware).

As I already mentioned, I didn't investigate the occurrence of those errors.
They could also have been arised when namd zombies did some strange stuff
while testing. But I can see that there were errors, what one cannot with
consumer hardware.

I think this "ECC yes or not" is like "Windows or Linux". There are
advantages and disadvantages. One needs to decide.

;)

>
> however, to run GPUs reliably like this, you need proper
> cooling and *there* is the biggest risk in my personal
> opinion, since by default GPUs are using a fan control
> regiment that keeps the noise down, but not the heat.
>
> i have come up with a little hack, that can operate our
> GPUs at over 20C lower core temperature and thus
> massively increases the reliability.
> http://sites.google.com/site/akohlmey/random-hacks/nvidia-gpu-coolness
>
> now lets move on to the question at hand
> of the original poster.
>
> i.e. purchasing a small amount of workstation
> type hardware. first of all, i am surprised to
> see an M-series GPU offered. those are
> passively cooled and thus can only be used
> in a properly certified case with proper ventilation
> and in a temperature controlled environment.
> so i would worry, if you can operate that machine
> at all. a C-series or a GeForce type GPU sounds
> more suitable to me. for a workstation, in the current
> situation, i would go for a couple of 3GB GeForce
> GTX 580 and just buy a spare GPU on the side for
> each box right away and then use the money saved
> to crank up the RAM in the machine, so you can
> use them for analysis as well. it is the best bang
> for the buck. just don't take a vendor overclocked
> version and don't take the cheapest variant. those
> cards have such an advantage in speed over any
> Tesla based offering that it is worth it, and you are
> not running hundreds of them, so that any higher
> probability of hardware issues would cost you
> too much time.
>
> if, however, you consider this too much of a risk,
> they i would recommend to not purchase a GPU
> at all, but get a machine with 4-way 8-core AMD
> Opteron machines (resist the temptation of going
> to higher core counts, it is not really worth the
> money). you can put together an extremely powerful
> and affordable NAMD workstation with this hardware,
> and if you want to be really stingy, you can save
> on the memory (i.e. get 32GB RAM) as well and
> perhaps get even more than the two machines
> that were mentioned. that would give you the
> advantage of not having to worry about the GPUs
> at all and not suffering from any current limitations
> that the GPU kernels in NAMD have.

Also a nice idea, namd speedups within a workstation are pretty nice.
And for a small number of nodes, Gigabit is very sufficient, whereas GPU
nodes
would already need expensive Infiniband or 10Gbit/s-Ethernet.

>
> cheers,
> axel.
>
> > I don't know the project you mentioned, but if it is distributed
> computing,
> > I would have implemented an error correction there (in simplest way
> double
> > computation on different nodes) as they for sure did also, because it
> can be
> > manipulated.
> >
> > Feel free to correct me.
> >
> > Cheers
> >
> > Norman Geist.
> >
> >
> >> -----Ursprüngliche Nachricht-----
> >> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> >> Auftrag von Nicholas M Glykos
> >> Gesendet: Dienstag, 14. Februar 2012 11:04
> >> An: Marcel UJI (IMAP)
> >> Cc: Norman Geist; namd-l_at_ks.uiuc.edu
> >> Betreff: Re: AW: namd-l: 2CPU+1GPU vs 1CPU+2GPU
> >>
> >>
> >> Dear Marcel, Norman, List,
> >>
> >> I'll play devil's advocate, bear with me. Measuring (and
> demonstrating)
> >> memory errors with memtest does nothing to answer the important
> >> question :
> >> Do these errors change the average long-term dehaviour (and derived
> >> quantites) from the simulations, or they just add (as white noise)
> >> another
> >> source of chaotropic behaviour in an already chaotic system ? I
> would
> >> argue that if the memory errors are trully random, then they can not
> be
> >> correlated with the aim of any given simulation, and, thus, can not
> be
> >> held responsible for things working out "incredibly great" or
> >> otherwise.
> >> If I were to offer an example in support of this thesis, I would
> >> probably
> >> quote the results obtained on folding simulations by the Shaw group
> >> (the
> >> Science 2010 paper) using the Anton machine which to my knowledge
> >> (please
> >> do correct me if I'm wrong) does not use ECC memory. Although I'm
> not
> >> advocating the incorporation of avoidable errors in calculations, I
> do
> >> feel that solid evidence for the effect of these errors on the MD-
> >> derived
> >> quantities is missing.
> >>
> >> My twocents,
> >> Nicholas
> >>
> >>
> >>
> >> On Tue, 14 Feb 2012, Marcel UJI (IMAP) wrote:
> >>
> >> > Yes I have found other sources with similar results (see
> >> > http://www.cs.stanford.edu/people/ihaque/talks/resilience-
> 2010.pdf),
> >> so
> >> > I think I will finally go for those Tesla cards.
> >> >
> >> > Thank you all for your help!
> >> >
> >> > Marcel
> >> >
> >> > Al 14/02/12 08:18, En/na Norman Geist ha escrit:
> >> > >
> >> > > Hi,
> >> > >
> >> > >
> >> > >
> >> > > I just wanted to add that I was pretty surprised when I first
> saw
> >> the
> >> > > ECC error counters on my Tesla C2050. Well in fact it's the
> total
> >> of
> >> > > double bit and I never investigated their occurrence but I would
> >> only
> >> > > go without ECC with some belly aches because everything that
> >> doesn't
> >> > > work or behave strange in your simulations, or even what works
> >> > > incredibly great can come due to artifacts of memory errors,
> that
> >> > > might sound a little overdone, but is possible. For what else,
> >> except
> >> > > of reliability, ECC has been developed. But I'm really not sure
> >> what
> >> > > influence those errors can really have, but with ecc you have
> one
> >> > > thing less to survey when problems occur.
> >> > >
> >> > >
> >> > >
> >> > > Best wishes
> >> > >
> >> > >
> >> > >
> >> > > Norman Geist.
> >> > >
> >> > >
> >> > >
> >> > > *Von:* owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-
> l_at_ks.uiuc.edu]
> >> *Im
> >> > > Auftrag von *Ajasja Ljubetic
> >> > > *Gesendet:* Montag, 13. Februar 2012 16:09
> >> > > *Cc:* Marcel UJI (IMAP); namd-l_at_ks.uiuc.edu
> >> > > *Betreff:* Re: namd-l: 2CPU+1GPU vs 1CPU+2GPU
> >> > >
> >> > >
> >> > >
> >> > > One final thing. I've done some benchmarking with a AMD 6-
> core
> >> > > desktop and a GTX-570 and it ends up being about equal
> >> (slightly
> >> > > faster) than a 6-core xeon with an M2070. You can buy a 3GB
> >> > > GTX580 for a fraction of the price of a M series card, and
> an
> >> AMD
> >> > > CPU (particularly the 3 GHz 6-core Thubans) will be close to
> >> half
> >> > > the price of the intel. While I'm sure the intel chip is
> >> > > generally superior to the AMD one, it doesn't seem to be a
> >> factor
> >> > > when running NAMD. So I would say buy two desktops and save
> >> > > yourself money and also gain performance. I know there is
> the
> >> > > lack of ECC memory with the GTX series, but I'm really not
> >> > > convinced that is a big issue for MD (maybe someone on the
> list
> >> > > has a different opinion).
> >> > >
> >> > >
> >> > >
> >> > > I'm running my simulations on several GTX 560 Ti for half a
> year
> >> now
> >> > > and it works great! So I would back up this advice.
> >> > >
> >> > >
> >> > >
> >> > > Best regards,
> >> > >
> >> > > Ajasja
> >> > >
> >> > >
> >> > >
> >> > > ~Aron
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Mon, Feb 13, 2012 at 6:44 AM, Nicholas M Glykos
> >> > > <glykos_at_mbg.duth.gr <mailto:glykos_at_mbg.duth.gr>> wrote:
> >> > >
> >> > >
> >> > >
> >> > > You will (hopefully) hear from Axel on this, but :
> >> > >
> >> > >
> >> > > > as it would give more speed for our NAMD based simulations
> >> > >
> >> > > Is this an assumption or the result of benchmarking the two
> >> hardware
> >> > > configurations with your intended system sizes ? For small
> >> (atom-wise)
> >> > > systems, you shouldn't expect much improvement by increasing
> >> the
> >> > > number of
> >> > > GPUs (and for tiny systems the 1CPU+2GPU may not scale at
> all).
> >> > >
> >> > > My twocents,
> >> > > Nicholas
> >> > >
> >> > >
> >> > > --
> >> > >
> >> > >
> >> > > Nicholas M. Glykos, Department of Molecular
> Biology
> >> > > and Genetics, Democritus University of Thrace,
> University
> >> Campus,
> >> > > Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office)
> >> > > +302551030620 <tel:%2B302551030620>,
> >> > > Ext.77620, Tel (lab) +302551030615 <tel:%2B302551030615>,
> >> > > http://utopia.duth.gr/~glykos/
> >> <http://utopia.duth.gr/%7Eglykos/>
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Aron Broom M.Sc
> >> > > PhD Student
> >> > > Department of Chemistry
> >> > > University of Waterloo
> >> > >
> >> > >
> >> > >
> >> >
> >> >
> >>
> >> --
> >>
> >>
> >> Nicholas M. Glykos, Department of Molecular Biology
> >> and Genetics, Democritus University of Thrace, University
> Campus,
> >> Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office)
> >> +302551030620,
> >> Ext.77620, Tel (lab) +302551030615,
> http://utopia.duth.gr/~glykos/
> >
> >
>
>
>
> --
> Dr. Axel Kohlmeyer
> akohlmey_at_gmail.com http://goo.gl/1wk0
>
> College of Science and Technology
> Temple University, Philadelphia PA, USA.

Next message: Nicholas M Glykos: "Re: AW: AW: 2CPU+1GPU vs 1CPU+2GPU"
Previous message: Wanzhi Qiu: "Graphene pdb/psf files cannot be read by psfgen"
In reply to: Axel Kohlmeyer: "Re: AW: 2CPU+1GPU vs 1CPU+2GPU"
Next in thread: Nicholas M Glykos: "Re: AW: AW: 2CPU+1GPU vs 1CPU+2GPU"
Reply: Nicholas M Glykos: "Re: AW: AW: 2CPU+1GPU vs 1CPU+2GPU"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:12 CST