Re: Re: Is clock skew a problem for charm++

From: Sameer Kumar (skumar2_at_cs.uiuc.edu)
Date: Thu Jun 22 2006 - 08:04:11 CDT

Hi Jan,

        If you try the Charm++ test suite megatest it should be able to
track down data corruption on the network.

        cd charm/tests/charm++/megatest/
        #and then run megatest on the same #pes as Namd

Also what system (molecule) are you running? Have you tried running the
standard ApoA1 benchmark? We could eliminate a few more possibilities
here.

                                                sameer.

On Thu, 22 Jun 2006, Jan Saam wrote:

> Hi Gengbin,
>
> thanks for your comments, I used MPICH. As mentioned in one of the other
> mails the problem is not specific to the mpi versions and also occurs in
> the prebuild versions.
>
> I think we must find out if it's possible that there occurs some data
> corruption. My assumption is that the master simply doesn't get what it
> expects. Since it's not segfaulting the data format might be correct but
> maybe the results from different nodes get mixed up?
>
> Is there some kind of test for that?
>
> Jan
>
>
> Gengbin Zheng wrote:
> > Jan Saam wrote:
> >
> >> Hi Gengbin and other listeners,
> >>
> >> the weird story goes on. From the documentation and older threads in
> >> namd-l I conclude that the mpi versions and the net versions of NAMD are
> >> from the users point of view equivalent, while there might be some minor
> >> performance differences depending on the architecture.
> >> I tried both versions and found that both suffer from the same
> >> performance problems mentioned earlier. However, mpi is yet 3 times
> >> slower.
> >>
> >>
> >
> > MPI version may be slower than net version, but I don't think 3 times
> > is typical. There may be problem with your MPI installation.
> >
> >> One thing I noticed about the mpi version of NAMD is that it spawns two
> >> processes on each node (while only one appears to consume significant
> >> resources according to "top"):
> >>
> >> jan 22216 22214 0 21:02 ? 00:00:00 /usr/sbin/sshd
> >> jan 22217 22216 81 21:02 ? 00:08:53 /home/jan/bin/namd2 BPU1
> >> 32887 4amslave -p4yourname BPU5 -p4rmrank 3
> >> jan 22246 22217 0 21:02 ? 00:00:00 /home/jan/bin/namd2 BPU1
> >> 32887 4amslave -p4yourname BPU5 -p4rmrank 3
> >>
> >> Should it be like that?
> >>
> >>
> >>
> >
> > Some MPI implementations such as LAM do spawn pthreads, that is
> > probably the reason you saw several instances in top.
> > If it is LAM, try MPICH. We have bad luck with LAM in the past.
> >
> > Gengbin
> >
> >> I started namd on 12 processors:
> >> mpirun -v -np 12 -leave_pg -machinefile ~/machines ~/bin/namd2 eq2.namd
> >>
> >>
> >>> eq2.out&
> >>>
> >> (My machinefile contains 12 valid machines)
> >>
> >> Jan
> >>
> >>
> >> Gengbin Zheng wrote:
> >>
> >>
> >>> Hi Jan,
> >>>
> >>> Clock skew may cause misleading time output, but I doubt it is the
> >>> case here (queens program) because the time was printed from the same
> >>> processor (0).
> >>> When you run the program, did it really take 7 minutes wallclock time?
> >>> Also, have you tried pingpong test from charm/tests/charm++/pingpong
> >>> to test network latency?
> >>>
> >>> Gengbin
> >>>
> >>> Jan Saam wrote:
> >>>
> >>>
> >>>> I forgot to say that I checked already that the problem is not ssh
> >>>> taking forever to make a connection.
> >>>> This is at least proven by this simple test:
> >>>> time ssh BPU5 pwd
> >>>> /home/jan
> >>>>
> >>>> real 0m0.236s
> >>>> user 0m0.050s
> >>>> sys 0m0.000s
> >>>>
> >>>> Jan
> >>>>
> >>>>
> >>>> Jan Saam wrote:
> >>>>
> >>>>
> >>>>
> >>>>> Hi all,
> >>>>>
> >>>>> I'm experiencing some weird performance problems with NAMD or the
> >>>>> charm++ library on a linux cluster:
> >>>>> When I'm using NAMD or a simple charmm++ demo program on one node
> >>>>> everything is fine, but when I use more that one node each step takes
> >>>>> _very_ much longer!
> >>>>>
> >>>>> Example:
> >>>>> 2s for the program queens on 1 node, 445s on 2 nodes!!!
> >>>>>
> >>>>> running
> >>>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/./pgm
> >>>>>
> >>>>>
> >>>>> on 1 LINUX ch_p4 processors
> >>>>> Created
> >>>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/PI28357
> >>>>>
> >>>>>
> >>>>> There are 14200 Solutions to 12 queens. Finish time=1.947209
> >>>>> End of program
> >>>>> [jan_at_BPU1 queens]$ mpirun -v -np 2 -machinefile ~/machines ./pgm 12 6
> >>>>> running
> >>>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/./pgm
> >>>>>
> >>>>>
> >>>>> on 2 LINUX ch_p4 processors
> >>>>> Created
> >>>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/PI28413
> >>>>>
> >>>>>
> >>>>> There are 14200 Solutions to 12 queens. Finish time=445.547998
> >>>>> End of program
> >>>>>
> >>>>> The same is true when I'm building the net-linux versions instead of
> >>>>> mpi-linux, thus the problem is probably independent of MPI.
> >>>>>
> >>>>> One thing I noticed is that there is a several minute clock skew
> >>>>> between
> >>>>> the nodes. Could that be part of my problem (unfortnately I don't
> >>>>> have
> >>>>> rights to simply synchronize the clocks)?
> >>>>>
> >>>>> Does anyone have an idea what the problem could be?
> >>>>>
> >>>>> Many thanks,
> >>>>> Jan
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>
> >>
> >>
> >
>
> --
> ---------------------------
> Jan Saam
> Institute of Biochemistry
> Charite Berlin
> Monbijoustr. 2
> 10117 Berlin
> Germany
>
> +49 30 450-528-446
> saam_at_charite.de
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:43:46 CST