viarecv.c rhandle->len' failed. on infiniband

From: Rene Salmon (rsalmon_at_tulane.edu)
Date: Fri Mar 23 2007 - 17:44:59 CDT

Hello,

We have compiled NAMD-2.6 with mvapich to run over infiniband using the
instructions on the NAMD wiki here.

http://www.ks.uiuc.edu/Research/namd/wiki/index.cgi?NamdOnInfiniBand

Namd seems to work nicely and scale for 2,4,8,16 cpus but once we try to run
on 32 cpus or more we start getting this error message:

namd2: viarecv.c:724: viadev_eager_pull_pack: Assertion
`rhandle->bytes_copied_to_user == rhandle->len'
 failed.
namd2: viarecv.c:724: viadev_eager_pull_pack: Assertion
`rhandle->bytes_copied_to_user == rhandle->len'
 failed.
namd2: viarecv.c:724: viadev_eager_pull_pack: Assertion
`rhandle->bytes_copied_to_user == rhandle->len'
 failed.
namd2: viarecv.c:724: viadev_eager_pull_pack: Assertion
`rhandle->bytes_copied_to_user == rhandle->len'
 failed.
[mpirund] rank 18 has got signal 6
[mpirund] rank 19 has got signal 6
.
.
.
[mpirund] rank 28 has got signal 6
Timeout for rank 20 hostname 'compute-01-01-ib'. Job is not finalized there.
Cleaning up all processes ...
Some rank on 'compute-01-13-ib' exited without finalize.
done.

Any ideas as to what might cause this?

Thank you
Rene

-- 
        Rene Salmon
        Tulane University
        Center for Computational Science
        http://www.ccs.tulane.edu
        rsalmon_at_tulane.edu
        Tel 504-862-8393
        Fax 504-862-8392

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:44:29 CST