Hadoop HA. Auto failover configured but Standby NN doesn't become active until NN is started again

I am using Hadoop 2.6.0-cdh5.6.0. I have configured HA. I have active(NN1) and standby namenodes(NN2) being displayed. Now when i issue a kill signal to the active namenode(NN1) the standby namenode(NN2) does not become active until I start the NN1 back again. After starting the NN1 again it takes the standby state and NN2 takes the active state. I haven't configured the "ha.zookeeper.session-timeout.ms" parameter, so I am assuming it would be default to 5 seconds. I am waiting for the time to complete before I check for the Active and Standby NNs.

My core-site.xml

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://mycluster/</value>
  </property>
  <property>
    <name>hadoop.proxyuser.mapred.groups</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.mapred.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>ha.zookeeper.quorum</name>
    <value>172.17.5.107:2181,172.17.3.88:2181,172.17.5.128:2181</value>
  </property>
</configuration>

My hdfs-site.xml

<configuration>
  <property>
   <name>dfs.permissions.superusergroup</name>
   <value>hadoop</value>
  </property>
  <property>
   <name>dfs.namenode.name.dir</name>
   <value>file:///data/1/dfs/nn</value>
  </property>
  <property>
   <name>dfs.datanode.data.dir</name>
   <value>file:///data/1/dfs/dn</value>
  </property>
  <property>
    <name>dfs.nameservices</name>
    <value>mycluster</value>
  </property>
  <property>
    <name>dfs.ha.namenodes.mycluster</name>
    <value>nn1,nn2</value>
  </property>
  <property>
    <name>dfs.namenode.rpc-address.mycluster.nn1</name>
    <value>172.17.5.107:8020</value>
  </property>
  <property>
    <name>dfs.namenode.rpc-address.mycluster.nn2</name>
    <value>172.17.3.88:8020</value>
  </property>
  <property>
    <name>dfs.namenode.http-address.mycluster.nn1</name>
    <value>172.17.5.107:50070</value>
  </property>
  <property>
    <name>dfs.namenode.http-address.mycluster.nn2</name>
    <value>172.17.3.88:50070</value>
  </property>
  <property>
    <name>dfs.namenode.shared.edits.dir</name>
    <value>qjournal://172.17.5.107:8485;172.17.3.88:8485;172.17.5.128:8485/mycluster</value>
  </property>
  <property>
    <name>dfs.client.failover.proxy.provider.mycluster</name>
    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
  </property>
  <property>
    <name>dfs.ha.fencing.methods</name>
    <value>sshfence</value>
  </property>
  <property>
    <name>dfs.ha.fencing.ssh.private-key-files</name>
    <value>/root/.ssh/id_rsa</value>
  </property>
  <property>
    <name>dfs.ha.automatic-failover.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>dfs.journalnode.edits.dir</name>
    <value>/data/1/dfs/jn</value>
  </property>
</configuration>

My zoo.cfg

maxClientCnxns=50
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
dataDir=/var/lib/zookeeper
# the port at which the clients will connect
clientPort=2181
# the directory where the transaction logs are stored.
dataLogDir=/var/lib/zookeeper

Answers


There was a problem with the sshfence. Grant the permissions to hdfs user or change it to root user

  <property>                                                                                   
    <name>dfs.client.failover.proxy.provider.mycluster</name>                                  
    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>   
  </property>                                                                                  
  <property>                                                                                   
    <name>dfs.ha.fencing.methods</name>                                                        
    <value>sshfence(root)</value>                                                              
  </property>                                                                                  
  <property>                                                                                   
    <name>dfs.ha.fencing.ssh.private-key-files</name>                                          
    <value>/var/lib/hadoop-hdfs/.ssh/id_rsa</value>                                            
  </property>                                                                                  
  <property>                                                                                   
    <name>dfs.ha.automatic-failover.enabled</name>                                             
    <value>true</value>                                                                        
  </property>                                                                                  
  <property>                                                                                   
    <name>dfs.journalnode.edits.dir</name>                                                     
    <value>/data/1/dfs/jn</value>                                                              
  </property>                                                                                  
</configuration>                                                                               

Need Your Help

Eclipse not using jars from add classpath variable

eclipse jar dependencies classpath

I added a classpath variable (via eclipse's build path) that points to a cache of jars. While this folder is represented in eclipse's folder view, the contained jars are not recognized for some rea...

Can we create derived public type from private base type in .net?

c# .net oop microsoft-metro windows-runtime

I know the question sound somewhat stupid, but i have this scenario.