Best practices for Namenode and Datanode restarts
Problems
Following are some problems we might come across while working in a large setup of hadoop clusters,
- Namenode restarts taking long time (http://nn-host:50070/dfshealth.html#tab-startup-progress)
- Namenode startup goes to safemode for a long time after restart
Best practices for Namenode & restarts
DO NOT restart all services at once. Instead do the following in order,
- Go to standby namenode first, and restart it
- Then restart the active namenode
- Do a rolling restart for datanodes. Increase the duration between restart jobs to be 3-4 minutes and restart 2 datanodes at a time. It is safer that was as running jobs should not get impacted. At least one copy is alive if replication factor is 3x.
Faster namenode startup
Most of the times startup times are long if there are large number of edit logs to load for a namenode. It is recommended to save Namespaces once in a while to rebuild fsimage once in a while (once a month or so). Make sure no jobs are running
# For all namenodes hdfs dfsadmin -safemode enter hdfs dfsadmin -saveNamespace hdfs dfsadmin -safemode leave # For specific namenode in case of HA (start with Standby first. Port is usually 8020 or 9000) hdfs dfsadmin -fs hdfs://<namenode-host>:<port> -safemode enter hdfs dfsadmin -fs hdfs://<namenode-host>:<port> -saveNamespace hdfs dfsadmin -fs hdfs://<namenode-host>:<port> -safemode leave
Exiting namenode safemode manually
DO NOT try to leave or exit the namenode manually using the command below 😀
hdfs dfsadmin -safemode leave
This could result in missing blocks or under replicated block for a namenode. Instead go to the namenode UI and check for the datanodes that has not reported the blocks to namenode and restart them individually. An easy way to find out those datanodes is from the number of blocks reported in the UI. They will be the once having oddly low number of blocks.
Switching Namenodes
Use the below command instead of bouncing the active namenode
hdfs dfsadmin -failover nn2(standby) nn1(active)
HTH