Re: Failure to run namd-cuda with gtx-470

From: Francesco Pietra (chiendarret_at_gmail.com)
Date: Thu Jul 07 2011 - 01:57:25 CDT

Solved, but only by a trick. The significance of "GPU fallen out of
the bus" (simply means that the gtx-470 are not seen by the system)
should call attention and find a deep seated solution. Unfortunately
not from me, a chemist who only knows his job. Unless I am the only
one getting into troubles with cuda, which seems to be the case.
chiendarret

On Wed, Jul 6, 2011 at 11:18 PM, Francesco Pietra <chiendarret_at_gmail.com> wrote:
> Replying as the author:
>
> Solved by setting persistent mode and monitoring from a ssh-linked
> desktop. The gtx470 machine appears from its terminal (no X system was
> raised) as if it were hanged (keyboard not sensed, terminal blinking
> at a stage long before namd md had started), actually it is working
> regularly (top -i: 6 processors; nvidia-smi -q - d TEMPERATURE output
> 90, 85), correct *.log. To avoid software problems to the gtx-470
> machine, shutdown will have to be issued from the desktop. I wonder
> whether there is a problem with nvidia latest driver using gtx470
> (problems started from upgrading to latest driver), however i don't
> mind as far as i can work. chiendarret
>
> On Wed, Jul 6, 2011 at 5:44 PM, Francesco Pietra <chiendarret_at_gmail.com> wrote:
>> While I could run namd-cuda md simulations without problems after
>> running (as root)
>>
>> nvidia-smi -L
>>
>> the situation has gradually worsened. After a two-days successl
>> simulation, this afternoon a similar simulation does not start at all
>> and the computer does no more sense the keyboard, as if it were
>> hanged. Actually, looking at /var/log/messages of the gig64 machine
>> from a ssh-linked desktop, kernel problems are indicated, as reported
>> in part below.
>>
>> ******************
>> Jul  6 17:26:42 gig64 kernel: [  230.179942] NVRM: GPU at 0000:01:00.0
>> has fallen off the bus.
>> Jul  6 17:28:04 gig64 kernel: [  312.424094] Modules linked in:
>> powernow_k8 mperf cpufreq_conservative cpufreq_powersave cpufreq_stats
>> cpufreq_userspace fuse nfsd exportfs nfs lockd fscache nfs_acl
>> auth_rpcgss sunrpc ext2 it87 hwmon_vid loop firewire_sbp2
>> snd_hda_codec_hdmi nvidia(P) snd_hda_codec_realtek snd_hda_intel
>> snd_hda_codec snd_hwdep snd_pcm snd_seq snd_timer snd_seq_device evdev
>> pcspkr k10temp snd i2c_piix4 soundcore edac_core edac_mce_amd i2c_core
>> parport_pc snd_page_alloc parport wmi button processor thermal_sys
>> ext3 jbd mbcache dm_mod raid1 md_mod usbhid hid sg sr_mod sd_mod cdrom
>> crc_t10dif ata_generic ohci_hcd xhci_hcd pata_atiixp pata_jmicron ahci
>> libahci libata ehci_hcd firewire_ohci usbcore firewire_core scsi_mod
>> crc_itu_t floppy nls_base r8169 mii [last unloaded: scsi_wait_scan]
>> Jul  6 17:28:04 gig64 kernel: [  312.424204] CPU 4
>> Jul  6 17:28:04 gig64 kernel: [  312.424208] Modules linked in:
>> powernow_k8 mperf cpufreq_conservative cpufreq_powersave cpufreq_stats
>> cpufreq_userspace fuse nfsd exportfs nfs lockd fscache nfs_acl
>> auth_rpcgss sunrpc ext2 it87 hwmon_vid loop firewire_sbp2
>> snd_hda_codec_hdmi nvidia(P) snd_hda_codec_realtek snd_hda_intel
>> snd_hda_codec snd_hwdep snd_pcm snd_seq snd_timer snd_seq_device evdev
>> pcspkr k10temp snd i2c_piix4 soundcore edac_core edac_mce_amd i2c_core
>> parport_pc snd_page_alloc parport wmi button processor thermal_sys
>> ext3 jbd mbcache dm_mod raid1 md_mod usbhid hid sg sr_mod sd_mod cdrom
>> crc_t10dif ata_generic ohci_hcd xhci_hcd pata_atiixp pata_jmicron ahci
>> libahci libata ehci_hcd firewire_ohci usbcore firewire_core scsi_mod
>> crc_itu_t floppy nls_base r8169 mii [last unloaded: scsi_wait_scan]
>> Jul  6 17:28:04 gig64 kernel: [  312.424302]
>> Jul  6 17:28:04 gig64 kernel: [  312.424309] Pid: 2916, comm: namd2
>> Tainted: P           O 2.6.38-2-amd64 #1 Gigabyte Technology Co., Ltd.
>> GA-890FXA-UD5/GA-890FXA-UD5
>> Jul  6 17:28:04 gig64 kernel: [  312.424322] RIP:
>> 0010:[<ffffffffa07d4f74>]  [<ffffffffa07d4f74>]
>> _nv015265rm+0x252/0x260 [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.424951] RSP:
>> 0018:ffff88042075fc88  EFLAGS: 00000297
>> Jul  6 17:28:04 gig64 kernel: [  312.424958] RAX: 00000000ffffffff
>> RBX: ffff88042b522000 RCX: 0000000000000019
>> Jul  6 17:28:04 gig64 kernel: [  312.424964] RDX: 00000000ffffffff
>> RSI: 0000000000005499 RDI: ffff88042b522028
>> Jul  6 17:28:04 gig64 kernel: [  312.424971] RBP: ffff8804217f2c88
>> R08: ffff8804217f2c98 R09: 0000000000000000
>> Jul  6 17:28:04 gig64 kernel: [  312.424977] R10: 0000000000000246
>> R11: 0000000000000028 R12: ffffffff8100a30e
>> Jul  6 17:28:04 gig64 kernel: [  312.424983] R13: 0000000000000000
>> R14: 0000000000000246 R15: 0000000000000028
>> Jul  6 17:28:04 gig64 kernel: [  312.424991] FS:
>> 00007f2f355a1720(0000) GS:ffff8800bfb00000(0000)
>> knlGS:0000000000000000
>> Jul  6 17:28:04 gig64 kernel: [  312.424998] CS:  0010 DS: 0000 ES:
>> 0000 CR0: 000000008005003b
>> Jul  6 17:28:04 gig64 kernel: [  312.425004] CR2: 00007f2f33947000
>> CR3: 000000041ef84000 CR4: 00000000000006e0
>> Jul  6 17:28:04 gig64 kernel: [  312.425010] DR0: 0000000000000000
>> DR1: 0000000000000000 DR2: 0000000000000000
>> Jul  6 17:28:04 gig64 kernel: [  312.425016] DR3: 0000000000000000
>> DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Jul  6 17:28:04 gig64 kernel: [  312.425024] Process namd2 (pid: 2916,
>> threadinfo ffff88042075e000, task ffff88041e5cd7c0)
>> Jul  6 17:28:04 gig64 kernel: [  312.425065]  ffff88042b522000
>> 0000000000000045 0000000000000000 0000000000000003
>> Jul  6 17:28:04 gig64 kernel: [  312.425077]  0000000000000000
>> ffffffffa045b0dc ffff88042b522000 ffff88042b057000
>> Jul  6 17:28:04 gig64 kernel: [  312.425088]  ffff88042fb30000
>> ffffffffa045b2e1 0000000000000002 0000000000000003
>> Jul  6 17:28:04 gig64 kernel: [  312.425494]  [<ffffffffa045b0dc>] ?
>> _nv002890rm+0x8b/0x9c [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.425853]  [<ffffffffa045b2e1>] ?
>> _nv005068rm+0x1f4/0x20a [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.426368]  [<ffffffffa069a5d7>] ?
>> _nv010159rm+0x97b/0x9a9 [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.426881]  [<ffffffffa069303a>] ?
>> _nv010153rm+0x314/0x3eb [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.427194]  [<ffffffffa03dc728>] ?
>> _nv002567rm+0x8c6/0x9a8 [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.427511]  [<ffffffffa03ebe35>] ?
>> _nv002030rm+0xac/0xf0 [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.427827]  [<ffffffffa03ebe00>] ?
>> _nv002030rm+0x77/0xf0 [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffffa0a20265>] ?
>> _nv002424rm+0x5b5/0x751 [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffffa0a1a8b1>] ?
>> rm_ioctl+0x30/0x10a [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffffa0a390e9>] ?
>> nv_kern_ioctl+0x31a/0x381 [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffffa0a3918d>] ?
>> nv_kern_unlocked_ioctl+0x1c/0x20 [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffff81104b0b>] ?
>> do_vfs_ioctl+0x467/0x4b4
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffff810d60fd>] ?
>> do_brk+0x2ca/0x326
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffff81104ba3>] ?
>> sys_ioctl+0x4b/0x70
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffff81009952>] ?
>> system_call_fastpath+0x16/0x1b
>> Jul  6 17:28:04 gig64 kernel: [  312.428001] Call Trace:
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffffa045b0dc>] ?
>> _nv002890rm+0x8b/0x9c [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffffa045b2e1>] ?
>> _nv005068rm+0x1f4/0x20a [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffffa069a5d7>] ?
>> _nv010159rm+0x97b/0x9a9 [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffffa069303a>] ?
>> _nv010153rm+0x314/0x3eb [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffffa03dc728>] ?
>> _nv002567rm+0x8c6/0x9a8 [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffffa03ebe35>] ?
>> _nv002030rm+0xac/0xf0 [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffffa03ebe00>] ?
>> _nv002030rm+0x77/0xf0 [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffffa0a20265>] ?
>> _nv002424rm+0x5b5/0x751 [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffffa0a1a8b1>] ?
>> rm_ioctl+0x30/0x10a [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffffa0a390e9>] ?
>> nv_kern_ioctl+0x31a/0x381 [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffffa0a3918d>] ?
>> nv_kern_unlocked_ioctl+0x1c/0x20 [nvidia]
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffff81104b0b>] ?
>> do_vfs_ioctl+0x467/0x4b4
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffff810d60fd>] ?
>> do_brk+0x2ca/0x326
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffff81104ba3>] ?
>> sys_ioctl+0x4b/0x70
>> Jul  6 17:28:04 gig64 kernel: [  312.428001]  [<ffffffff81009952>] ?
>> system_call_fastpath+0x16/0x1b
>> root_at_gig64:/var/log#
>> Message from syslogd_at_gig64 at Jul  6 17:29:28 ...
>>  kernel:[  396.424694] Stack:
>>
>> Message from syslogd_at_gig64 at Jul  6 17:29:28 ...
>>  kernel:[  396.424762] Call Trace:
>>
>> Message from syslogd_at_gig64 at Jul  6 17:29:28 ...
>>  kernel:[  396.428001] Code: be a0 e8 85 65 60 00 b8 00 00 00 00 e8 9d
>> 65 60 00 85 c0 74 10 e8 84 28 63 00 0f 1f 00 eb 06 89 77 68 89 4f 6c
>> 48 83 c5 10 5b c3 <41> 54 53 48 83 ec 08 48 83 ed 08 41 89 f4 39 77 68
>> 73 17 39 77
>> ******************
>>
>> such messages from syslog_at_gig64 continue at slow pace.
>>
>>
>> Thanks for advice
>>
>> francesco pietra
>>
>>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:57:25 CST