IA64 and MPICH and mpiexec runtime shmem errors

From: Randy Crawford (rand_at_rice.edu)
Date: Thu Aug 05 2004 - 14:38:06 CDT

When I run NAMD 2.5 on an Itanium2 cluster using MPICH 1.2.5.2 and OSC's mpiexec
(instead of mpirun), I get the following error when running a 2 or 4 process
test case:

p2_15517: (38.889341) xx_shmalloc: returning NULL; requested 65584
p2_15517: (38.889341) p4_shmalloc returning NULL; request = 65584 bytes
You can increase the amount of memory by setting the environment variable
P4_GLOBMEMSIZE (in bytes); the current size is 4194304
p2_15517: p4_error: alloc_p4_msg failed: 0
CHARMDEBUG> Processor 3 has PID 15518
CHARMDEBUG> Processor 1 has PID 13334
bm_list_13335: (39.139197) net_send: could not write to fd=5, errno =32

After I enlarge P4_GLOBMEMSIZE from 4 MB to 2000 MB, I get:

p0_11034: p4_error: exceeding max num of P4_MAX_SYSV_SHMIDS: 256
p0_11043: (0.007812) send_message: to=1; invalid conn type=5
p0_11043: p4_error: subtree_broadcast_p4 failed, type=: 1010101010

After recompiling MPICH to allow 1024 P4_MAX_SYSV_SHMIDS, I get:

p0_2109: (0.003906) send_message: to=1; invalid conn type=5
p0_2109: p4_error: subtree_broadcast_p4 failed, type=: 1010101010

I've tried using two different versions of MPICH -- one that uses shmem, and one
that does not. Both are unable to run NAMD on more than one (2 CPU) node. I've
tried using a myrinet MPICH (which also uses shmem when on a single node) with
the same results.

The cluster uses Red Hat 2.1 Enterprise, gcc 3.3.3, over gigabit ethernet and
Myrinet. The system limit for shared memory segment size is set to unlimited.

Any idea what's going on? Does the problem lie with mpiexec?

Thanks,

    Randy

-- 
Randy Crawford   http://www.ruf.rice.edu/~rand   rand AT rice DOT edu

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:37:48 CST