Re: CHARMRUN ERROR

From: Scott Brozell (srb_at_osc.edu)
Date: Fri May 19 2017 - 11:17:18 CDT

Hi,

Certainly internet searching and wikipedia for background is a
common methodology. If one follows a particular forum or mailing list
daily for a while then one can form a knowledge base.

Increasingly time intensive approaches would be
1. training courses -
lots of centers provide various materials:
https://www.osc.edu/supercomputing
http://www.alcf.anl.gov/

2. howtos, user guides, and textbooks -
http://tldp.org/
https://en.wikipedia.org/wiki/System_administrator#References

3. take traditional courses or build a cluster yourself.

happy computing,
scott

On Thu, May 18, 2017 at 06:38:26PM +0000, Zeki Zeybek wrote:
> Thank you so much Scott for your detailed answer. I will adjust it as you suggested. Meanwhile, can you tell me that how can I educate myself about this matter as in like whenever I come cross a problem I just google it and read the forums and filter out some irrelevant issues and blindly apply the codes they shared. However I don't really have a broad understanding of what is really going on. Simply how can I enhance my knowledge about those systems like clusters ssh supercomputers etc... Should I just pick a relevant book or something?. I am not a technophobic person or something but to be honest handling things in this environment is giving me a bit of hard time.
>
> Get Outlook for Android<https://aka.ms/ghei36>
>
> ________________________________
> From: Scott Brozell <srb_at_osc.edu>
> Sent: Thursday, May 18, 2017 9:28:52 PM
> To: namd-l_at_ks.uiuc.edu; Zeki Zeybek
> Subject: Re: namd-l: CHARMRUN ERROR
>
> Hi,
>
> Presumably your cluster is on a trusted network, nevertheless:
>
> 1. I would not use an automatic workaround. Instead apply the
> scientific method - keep a record of these instances and report
> them to your cluster support staff. These are unusual events
> in my experience. Even in the most likely case that there is
> nothing suspicious going on, your cluster should have a policy
> and notification mechanism for the underlying issue (which is
> possibly merely cluster maintenance).
>
> 2. If you use this automatic workaround then make the pattern
> more specific to your cluster's hostname, ie, replace the asterisk
> with yourhost.org
>
> scott
>
> On Thu, May 18, 2017 at 07:27:27AM +0000, Zeki Zeybek wrote:
> > I somehow figured out a more crude way of handling the problem. Simply just open a new file specifically named
> >
> > as "config", file name must be config. Then add the following inside the file config. Make sure that the config file is located in your main account directory not scratch i.e. clustername/home/accountName/.ssh.
> >
> >
> > Add those into the config file,
> >
> >
> > Host *
> > StrictHostKeyChecking no
> >
> >
> > ________________________________
> > From: Zeki Zeybek
> > Sent: 12 May 2017 10:05:13
> > To: Boonstra, S.; namd-l_at_ks.uiuc.edu
> > Subject: Re: namd-l: CHARMRUN ERROR
> >
> > Thank you for your help and also for explaining the cause behind the problem but interestingly the problem is somehow solved by itself. I tried to start the simulation just after an hour or so it worked like a charm. Once again thank you for the insight about the issue.
> >
> > Get Outlook for Android<https://aka.ms/ghei36>
> >
> > ________________________________
> > From: Boonstra, S. <s.boonstra_at_rug.nl>
> > Sent: Thursday, May 11, 2017 11:03:38 PM
> > To: namd-l_at_ks.uiuc.edu; Zeki Zeybek
> > Subject: Re: namd-l: CHARMRUN ERROR
> >
> > Hi Zeki,
> >
> > I dealt with the same problem on our cluster just yesterday.
> >
> > Possibly, the RSA fingerprint of the node(s) has changed.
> > See also http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l.2013-2014/2465.html
> > and
> > https://askubuntu.com/questions/45679/ssh-connection-problem-with-host-key-verification-failed-error
> >
> > You can renew the fingerprints (they end up in .ssh/known_hosts) of all the nodes (or nodes in $server_list)
> > with a (bash) script like
> >
> > server_list=`sinfo -N --format="%N" | sort -u | grep tcn1[67]` #slurm specific
> > for h in $server_list; do
> > printf "$h " #verbose
> > ip=$(dig +search +short $h)
> > ssh-keygen -R $h
> > ssh-keygen -R $ip
> > ssh-keyscan -H $ip >> ~/.ssh/known_hosts
> > ssh-keyscan -H $h >> ~/.ssh/known_hosts
> > done
> > print #verbose
> >
> >
> > On Thu, May 11, 2017 at 9:39 AM, Zeki Zeybek <zeki.zeybek_at_bilgiedu.net<mailto:zeki.zeybek_at_bilgiedu.net>> wrote:
> >
> > Hi!
> >
> >
> > Everything has been running smoothly till today. I did not change anything in the script or in the config file. The error output is;
> >
> > sardalya>> name of the partition in which I am trying to use the nodes
> >
> >
> > ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory^M
> > Host key verification failed.^M
> > Charmrun> Error 255 returned from remote shell (sardalya78:0)
> > Charmrun> Reconnection attempt 1 of 3
> > ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory^M
> > Host key verification failed.^M
> > ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory^M
> > Host key verification failed.^M
> > ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory^M
> > Host key verification failed.^M
> > Charmrun> Error 255 returned from remote shell (sardalya79:1)
> > Charmrun> Reconnection attempt 1 of 3
> > Charmrun> Error 255 returned from remote shell (sardalya80:2)
> > Charmrun> Reconnection attempt 1 of 3
> > Charmrun> Error 255 returned from remote shell (sardalya81:3)
> > Charmrun> Reconnection attempt 1 of 3
> > ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory^M
> > Host key verification failed.^M
> > Charmrun> Error 255 returned from remote shell (sardalya78:0)
> > Charmrun> Reconnection attempt 2 of 3
> > ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory^M
> > Host key verification failed.^M
> > ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory^M
> > Host key verification failed.^M
> > Charmrun> Error 255 returned from remote shell (sardalya79:1)
> > Charmrun> Reconnection attempt 2 of 3
> > Charmrun> Error 255 returned from remote shell (sardalya80:2)
> > Charmrun> Reconnection attempt 2 of 3
> > ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory^M
> > ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory^M
> > Host key verification failed.^M
> > Host key verification failed.^M
> > Charmrun> Error 255 returned from remote shell (sardalya81:3)
> > Charmrun> Reconnection attempt 2 of 3
> > Charmrun> Error 255 returned from remote shell (sardalya78:0)
> > Charmrun> Reconnection attempt 3 of 3
> > ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory^M
> > Host key verification failed
> > Charmrun> Error 255 returned from remote shell (sardalya81:3)
> > Charmrun> Reconnection attempt 3 of 3
> > Charmrun> Error 255 returned from remote shell (sardalya78:0)
> > Charmrun> Too many reconnection attempts; bailing out

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2018 - 23:20:18 CST