NAMD Wiki: NamdOnIA64

  You are encouraged to improve NamdWiki by adding content, correcting errors, or removing spam.

Itanium, Itanium 2, etc.

See also NamdOnAltix


Nancy Tran from NCSA reported a crash at the end of the alanin example on their Linux Itanium machines. The resolution was that it was related to http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=82433 and should be fixed in the next release of glibc 2.3.


There are performance issues with the Intel 8.1 and newer compilers. The 7.x compilers work fine, and 8.0 seems OK but don't trust me on that.

Here are my results from the NCSA Altix (1.6 GHz Itanium 2 CPUs):

Compiler versions tested:

Version 8.0   Build 20040617 Package ID: l_cc_pc_8.0.066_pl069.1
Version 8.1    Build 20041123 Package ID: l_cc_pc_8.1.026
Version 9.0  Beta  Build 20050120 Package ID: l_cc_b_9.0.012

An old binary built with the 7.1 compiler as well as a new binary with the 8.0 compiler both run at approximately 2.1 s/step for the standard apoa1 benchmark. A new binary built with the 8.1 compiler runs the same benchmark at 4.9 s/step.

The explanation of and test for this slowdown is to remove and make obj/ComputeNonbondedStd.o, which contains the normal inner loop functions. The output (save it to a log file) will contain software pipelining diagnostics.

Search for "Software pipeliner" and you'll find comments like this:

Swp report for loop at line 594 in
_ZN20ComputeNonbondedUtil19calc_self_fullelectEP9nonbonded
in file src/ComputeNonbondedBase.h

        Resource II   =        1
        Recurrence II =        2
        Minimum II    =        2
        Last attempted II  =   3

        Estimated GCS II   =   2

        Software pipeliner estimated that it is more
        profitable to schedule the loop using the global
        acyclic scheduler than to pipeline the loop

With the 8.0 compiler those only show up for line 594, which isn't critical (it's a one-line for loop that patches up the exclusion checksum). The same lines keep showing up because the same file is re-included multiple times and compiled into a different function based on preprocessor macros.

The critical loop is found by searching for "line 14":

Swp report for loop at line 14 in _ZN20ComputeNonbondedUtil9calc_pairEP9nonbonded
in file src/ComputeNonbondedBase2.h

        Resource II   =        22
        Recurrence II =        10
        Minimum II    =        22
        Scheduled II  =        24

        Estimated GCS II   =   86

        Percent of Resource II needed by arithmetic ops     =  55%
        Percent of Resource II needed by memory ops         =  55%
        Percent of Resource II needed by floating point ops = 100%

        Number of stages in the software pipeline =    5

That was the 8.0 compiler, and all is well (software pipelining worked and at least one resource is being used for maximum capacity). With the 8.1 and 9.0 compilers I get:

Swp report for loop at line 14 in _ZN20ComputeNonbondedUtil9calc_pairEP9nonbonded
in file src/ComputeNonbondedBase2.h

        Resource II   =        21
        Recurrence II =        21
        Minimum II    =        21
        Last attempted II  =   112

        Estimated GCS II   =   86

        Software pipeliner estimated that it is more
        profitable to schedule the loop using the global
        acyclic scheduler than to pipeline the loop

Behold, 86 instructions is a lot more than 24, and certainly enough to account for the factor of 2.3 slowdown observed with the 8.1 compiler.

-JimPhillips


There is a small workaround that brings performance back to >90% of the icc8.0 compilers. By adding #pragma swp to line 11 in ComputeNonbondedBase2.h one gets the following pipline report Swp report for loop at line 15 in _ZN20ComputeNonbondedUtil9calc_pairEP9nonbonded in file src/ComputeNonbondedBase2.h

Resource II   =        21
Recurrence II =        21
Minimum II    =        21
Scheduled II  =        112

Estimated GCS II   =   85

Percent of Resource II needed by arithmetic ops     =  57%
Percent of Resource II needed by memory ops         =  57%
Percent of Resource II needed by floating point ops = 100%

Number of stages in the software pipeline =    2

The performance as measured on 16 cpus using the apoa1 benchmark are below

With the icc8.0 compiler and enableblocking receives turned on I get

Benchmark time: 64 CPUs 0.0884288 s/step 1.02348 days/ns 92640 kB memory Benchmark time: 64 CPUs 0.0840028 s/step 0.972255 days/ns 92640 kB memory Benchmark time: 64 CPUs 0.0852305 s/step 0.986464 days/ns 92640 kB memory

With icc8.1 compiler, enableblocking receives, no software piplining

Benchmark time: 64 CPUs 0.191893 s/step 2.22098 days/ns 109208 kB memory Benchmark time: 64 CPUs 0.20277 s/step 2.34688 days/ns 109208 kB memory Benchmark time: 64 CPUs 0.199111 s/step 2.30453 days/ns 109208 kB memory

With icc8.1 compiler, enableblocking receives, software piplining ON

Benchmark time: 64 CPUs 0.0908668 s/step 1.0517 days/ns 108176 kB memory Benchmark time: 64 CPUs 0.090242 s/step 1.04447 days/ns 108176 kB memory Benchmark time: 64 CPUs 0.0906576 s/step 1.04928 days/ns 108176 kB memory --Brian Bennion

I tried pragma ivdep as Jim suggested, the data are below

Benchmark time: 64 CPUs 0.115189 s/step 1.33321 days/ns 109064 kB memory Benchmark time: 64 CPUs 0.105449 s/step 1.22047 days/ns 109064 kB memory Benchmark time: 64 CPUs 0.102284 s/step 1.18384 days/ns 109064 kB memory

So compared to swp and icc8.0 the timings are worse but not that much worse. The relevant place in the compile log states: Swp report for loop at line 16 in _ZN20ComputeNonbondedUtil9calc_pairEP9nonbonded in file src/ComputeNonbondedBase2.h

Resource II   =        21
Recurrence II =        10
Minimum II    =        21
Scheduled II  =        24

Estimated GCS II   =   86

Percent of Resource II needed by arithmetic ops     =  57%
Percent of Resource II needed by memory ops         =  57%
Percent of Resource II needed by floating point ops = 100%

Number of stages in the software pipeline =    9

So the number of stages has increased by 7, not sure what that means.

--Brian Bennion

Adding the following compile switch "-ivdep_parallel" to the compile options gave the following benchmark for 16nodes64cpus.

No swp directive is present in this compilation.

Benchmark time: 64 CPUs 0.0494337 s/step 0.572149 days/ns 109368 kB memory Benchmark time: 64 CPUs 0.0484644 s/step 0.56093 days/ns 109368 kB memory Benchmark time: 64 CPUs 0.0483612 s/step 0.559736 days/ns 109368 kB memory So the speedup is wildly increased!! at least for 64cpus other tests pending The compile log is shown below...

========================================================================== SWP REPORT LOG OPENED ON Tue Mar 29 09:19:02 2005 ===========================================================================


Swp report for loop at line 16 in _ZN20ComputeNonbondedUtil9calc_pairEP9nonbonded in file src/ComputeNonbondedBase2.h

Resource II   =        21
Recurrence II =        10
Minimum II    =        21
Scheduled II  =        24

Estimated GCS II   =   86

Percent of Resource II needed by arithmetic ops     =  57%
Percent of Resource II needed by memory ops         =  57%
Percent of Resource II needed by floating point ops = 100%

Number of stages in the software pipeline =    9

nothing changed here


I added the swp directive after the ivdep directive and before the critical loop in ComputeNonbondedBase2.h

Info: Benchmark time: 64 CPUs 0.0491934 s/step 0.569368 days/ns 106696 kB memory Info: Benchmark time: 64 CPUs 0.0479566 s/step 0.555053 days/ns 106696 kB memory Info: Benchmark time: 64 CPUs 0.0477935 s/step 0.553166 days/ns 106696 kB memory

So it is slightly better than with ivdep alone. Compile report shows: Swp report for loop at line 20 in _ZN20ComputeNonbondedUtil9calc_pairEP9nonbonded in file src/ComputeNonbondedBase2.h

Resource II   =        21
Recurrence II =        10
Minimum II    =        21
Scheduled II  =        24

Estimated GCS II   =   86

Percent of Resource II needed by arithmetic ops     =  57%
Percent of Resource II needed by memory ops         =  57%
Percent of Resource II needed by floating point ops = 100%

Number of stages in the software pipeline =    9


Next are the -mcpu and -parallel switches.... No change in runtime with -mcpu=itanium2. Added -tpp2...no change. Added -parallel....back to regular poor performance