hadoop - 在 HDFS 中查找超过 N 天的目录

Question

可以使用 hadoop fs -ls 查找所有早于 N 天（从当前日期开始）的目录吗？

我正在尝试编写一个清理例程来查找和删除 HDFS 上的所有目录（匹配模式），这些目录是在当前日期前 N 天创建的。

score 17 · Accepted Answer

此脚本列出所有早于的目录[days]：

#!/bin/bash
usage="Usage: $0 [days]"

if [ ! "$1" ]
then
  echo $usage
  exit 1
fi

now=$(date +%s)
hadoop fs -lsr | grep "^d" | while read f; do 
  dir_date=`echo $f | awk '{print $6}'`
  difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) ))
  if [ $difference -gt $1 ]; then
    echo $f;
  fi
done

score 7 · Accepted Answer

如果你碰巧使用CDH的是 Hadoop 的发行版，它带有一个非常有用的HdfsFindTool命令，它的行为类似于 Linux 的find命令。

如果您使用的是默认包裹信息，请按照以下步骤操作：

hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-*-job.jar \
org.apache.solr.hadoop.HdfsFindTool -find PATH -mtime +N

您将 PATH 替换为搜索路径，将 N 替换为天数。

score 4 · Accepted Answer

对于真正的集群，使用 ls 不是一个好主意。如果您有管理员权限，则更适合使用 fsimage。

我修改上面的脚本来说明想法。

首先，获取 fsimage

curl "http://localhost:50070/getimage?getimage=1&txid=latest" > img.dump

将其转换为文本（与 lsr 给出的输出相同）

hdfs oiv -i img.dump -o fsimage.txt

脚本：

#!/bin/bash
usage="Usage: dir_diff.sh [days]"

if [ ! "$1" ]
then
  echo $usage
  exit 1
fi

now=$(date +%s)
curl "http://localhost:50070/getimage?getimage=1&txid=latest" > img.dump
hdfs oiv -i img.dump -o fsimage.txt
cat fsimage.txt | grep "^d" | while read f; do 
  dir_date=`echo $f | awk '{print $6}'`
  difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) ))
  if [ $difference -gt $1 ]; then
    echo $f;
  fi
done

score 3 · Accepted Answer

3

hdfs dfs -ls /hadoop/path/*.txt|awk '$6 < "2017-10-24"'

于 2017-10-24T09:46:50.993 回答

score 2 · Accepted Answer

我没有HdfsFindTool，也没有fsimagefrom curl，而且我不太喜欢使用and循环的lsto 。但我很欣赏这些答案。grepwhiledate awkhadoopawk

我觉得它可以只用一个ls，一个awk，也许一个xargs。

我还添加了在选择删除文件之前列出文件或汇总文件的选项，以及选择特定目录。最后我离开了目录，只关心文件。

#!/bin/bash
USAGE="Usage: $0 [N days] (list|size|delete) [path, default /tmp/hive]"
if [ ! "$1" ]; then
  echo $USAGE
  exit 1
fi
AGO="`date --date "$1 days ago" "+%F %R"`"

echo "# Will search for files older than $AGO"
if [ ! "$2" ]; then
  echo $USAGE
  exit 1
fi
INPATH="${3:-/tmp/hive}"

echo "# Will search under $INPATH"
case $2 in
  list)
    hdfs dfs -ls -R "$INPATH" |\
      awk '$1 ~ /^[^d]/ && ($6 " " $7) < '"\"$AGO\""
  ;;
  size)
    hdfs dfs -ls -R "$INPATH" |\
      awk '$1 ~ /^[^d]/ && ($6 " " $7) < "'"$AGO"'" {
           sum += $5 ; cnt += 1} END {
           print cnt, "Files with total", sum, "Bytes"}'
  ;;
  delete)
    hdfs dfs -ls -R "$INPATH" |\
      awk '$1 ~ /^[^d]/ && ($6 " " $7) < "'"$AGO"'" {print $8}' | \
      xargs hdfs dfs -rm -skipTrash
  ;;
  *)
    echo $USAGE
    exit 1
  ;;
esac

我希望其他人觉得这很有用。

hadoop - 在 HDFS 中查找超过 N 天的目录

5 回答 5

Related

Reference