Cleanup hdfs directory having too many files and directories
At times some directories on hdfs has too many inodes (files and folders) and it is really hard to delete. Some instances also lead to out of memory (OOM) errors such as the following error,
INFO retry.RetryInvocationHandler: java.io.IOException: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: GC overhead limit exceeded, while invoking ClientNamenodeProtocolTranslatorPB.getListing over namenode-server.domain.tld/x.y.x.a:8020. Trying to failover immediately
Set of below shell commands has helped me pull out the list of files and delete them from hdfs,
export HADOOP_CLIENT_OPTS="-XX:-UseGCOverheadLimit -Xmx16000m" hdfs dfs -ls /tmp/hive/hive | awk '/2018-[0-9][0-9]-/{print $8}' | paste -d ' ' - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - >/tmp/del_hive_01.txt cat /tmp/del_hive_01.txt paste -d ' ' - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > /tmp/del_hive_02.txt cat /tmp/del_hive_01.txt | paste -d ' ' - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > /tmp/del_hive_02.txt awk -F'\t' '{print "hdfs dfs -rm -r -f -skipTrash ", $1 }' /tmp/del_hive_02.txt > /tmp/del_hive_03.txt sh /tmp/del_hive_03.txt
export was required to increase mem, else it was giving GC error even to list the directory or delete a single file/dir based on wildcard.
paste was used to delete multiple directories at once, else it would have taken ages to clean up the dir.
You may change the awk filter pattern based on the files you are trying to cleanup.
HTH