Re: NAMD job dies on 2-quad core server

From: Gengbin Zheng (gzheng_at_illinois.edu)
Date: Thu Apr 09 2009 - 23:14:31 CDT

"NAMD_2.7b1_Linux-x86" is based on UDP, while "NAMD_2.7b1_Linux-x86-TCP"
is based on reliable TCP network protocol. That is the only difference.

Try the binary "NAMD_2.7b1_Linux-x86" with +netpoll runtime option and
see if that helps. (this basically disable the interrupt and poll the
network aggressively)

Gengbin

vivek.viv.sharma_at_gmail.com wrote:
> Hello Gengbin,
>
> Thanks for your reply.
>
> Now, NAMD seems to be working fine, all I did was to use the binary
> "NAMD_2.7b1_Linux-x86-TCP", . In my upgradation from NAMD2.6 to
> NAMD2.7b1, I tried the other binary "NAMD_2.7b1_Linux-x86" but it
> showed the same behaviour as the previous one showed (NAMD2.6). I do
> not know what exactly are the differences between the two
> "NAMD_2.7b1_Linux-x86" AND "NAMD_2.7b1_Linux-x86-TCP". Would you
> please say few words about the difference between the two?
>
> thanks again,
>
> Vivek
>
> On Apr 9, 2009 8:52pm, Gengbin Zheng <gzheng_at_illinois.edu> wrote:
> >
> >
> > Vivek,
> >
> >
> >
> >
> >
> > When NAMD is busy doing communication (sending messages), or doing
> load balancing, it may appear idle, or only one processor shows busy
> for a short period of time. Also check if your job is running close to
> the memory capacity (you can see that from "top"). Operating system
> may be busy swapping your NAMD job to/from disk which also causes idle
> time.
> >
> >
> >
> > Gengbin
> >
> >
> >
> >
> >
> > vivek.viv.sharma_at_gmail.com wrote:
> >
> >
> > Helllo Axel and all,
> >
> >
> >
> > Well, it was my mistake to mention in my previous post that 'job
> dies'. Well, in fact my 'this' statement was based on observing the
> 'top' command on the console. The job is running fine on all the 8
> processors. Its not dying, but every now and then the 'top' command
> shows that none of the processor is being used by NAMD. Like the NAMD
> process goes in background and comes back running again with all 8
> processors in use, as observed in the 'top' command. When I check with
> 'ps -e' all the NAMD jobs are there. Can anyone please throw some
> light on this, that why such a behaviour is being observed, that NAMD
> jobs appear-go in background-re-appear in the top command (?)
> >
> >
> >
> > Now, am I right in thinking that 'this-way' running of jobs will
> take more time than it should (?). I assume here that when 'top'
> comamnd does not show NAMD running, simulation is not running. (could
> be wrong though, might be something else is being done in this time).
> >
> >
> >
> > Axel, thanks for your points, I should have observed more before
> posting. Secondly, I really admire the beauty of this simple NAMD
> command which can be used to run the simulation without much
> installation work to be done.
> >
> >
> >
> > thanks and regards,
> >
> >
> >
> > Vivek
> >
> >
> >
> > On Apr 6, 2009 7:14pm, Axel Kohlmeyer akohlmey_at_cmm.chem.upenn.edu>
> wrote:
> >
> > > On Mon, 2009-04-06 at 04:45 +0000, vivek.viv.sharma_at_gmail.com wrote:
> >
> > >
> >
> > > > Hello everyone,
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > We have recently bought a machine with the following configuration:
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > 2 quad core processors each with 2.33GHz clock rate.
> >
> > >
> >
> > > > 8 GB RAM
> >
> > >
> >
> > > > 500GB total hard disk
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > I have simply used the "NAMD_2.6_Linux-i686" binaries. And, started
> >
> > >
> >
> > > > the simulation (membrane protein with membrane, water, ions.). The
> >
> > >
> >
> > > > simulation starts fine with the command
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > ./charmrun ++local +p 8 ./namd2 config.txt > config.log &
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > But after 4390 steps the job dies, without giving any error message.
> >
> > >
> >
> > > > Would you please suggest what is happening? Do I need to install
> >
> > >
> >
> > >
> >
> > >
> >
> > > how should anybody know??? does your input run fine elsewhere? have
> >
> > >
> >
> > > you looked at the trajectory? have you looked at the machine logs?
> >
> > >
> >
> > > does your os have restrictive limits for interactive use or stack
> >
> > >
> >
> > > memory? can you run the same job with less processors? how is the
> >
> > >
> >
> > > CPU temperature? is the crash reproducable? ...
> >
> > >
> >
> > >
> >
> > >
> >
> > > this list can go on for much longer. so please keep in mind that
> >
> > >
> >
> > > the kind of suggestion you can receive from a mailing list is
> >
> > >
> >
> > > directly proportional to the kind and quality of information
> >
> > >
> >
> > > you provide. in you case, you just say "it doesn't work". and
> >
> > >
> >
> > > only for one specific configuration. that is _very_ little.
> >
> > >
> >
> > >
> >
> > >
> >
> > > > it(NAMD) from scratch?
> >
> > >
> >
> > >
> >
> > >
> >
> > > why? first you have to find out what happens.
> >
> > >
> >
> > > blind activism never helps!
> >
> > >
> >
> > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > NAMD in log file shows clearly:
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > >> Info: Running on 8 processors.
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > I observed from 'top', indeed simulation runs on all 8 processors,
> >
> > >
> >
> > > > using more or less efficiently all the processors.
> >
> > >
> >
> > >
> >
> > >
> >
> > > there has to be more output, and i am pretty certain that there is
> >
> > >
> >
> > > some output that indicates what is going wrong.
> >
> > >
> >
> > >
> >
> > >
> >
> > > > All your suggestions will be very helpful.
> >
> > >
> >
> > >
> >
> > >
> >
> > > well, you got a ton of them already. the most important
> >
> > >
> >
> > > one is to include more relevant information. there are
> >
> > >
> >
> > > many, many cases on this mailing list where people ask
> >
> > >
> >
> > > for help with problems, and you can easily derive from
> >
> > >
> >
> > > the dialog what information is needed and what _you_
> >
> > >
> >
> > > can do beforehand to verify and you are seeing a real
> >
> > >
> >
> > > problem and what information is need to narrow it down.
> >
> > >
> >
> > >
> >
> > >
> >
> > > cheers,
> >
> > >
> >
> > > axel.
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > > > thanks and regards,
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > Vivek
> >
> > >
> >
> > > > IMM, India.
> >
> > >
> >
> > >
> >
> > >
> >
> > > --
> >
> > >
> >
> > >
> =======================================================================
> >
> > >
> >
> > > Axel Kohlmeyer akohlmey_at_cmm.chem.upenn.edu http://www.cmm.upenn.edu
> >
> > >
> >
> > > Center for Molecular Modeling -- University of Pennsylvania
> >
> > >
> >
> > > Department of Chemistry, 231 S.34th Street, Philadelphia, PA
> 19104-6323
> >
> > >
> >
> > > tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
> >
> > >
> >
> > >
> =======================================================================
> >
> > >
> >
> > > If you make something idiot-proof, the universe creates a better
> idiot.
> >
> > >
> >
> > >
> >
> > >
> >
> >
> >
> >

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:35 CST