shell - 获取文件的 25% 行

Question

我尝试随机显示 25% 的文件行

这是我的脚本：

file=$1
nb_lignes=$(wc -l $file | cut -d " " -f1)
num_lines_to_get=$((25*${nb_lignes}/100)) 
for (( i=0; i < $num_lines_to_get; i++))
do
line=$(head -$((${RANDOM} % $nb_lignes)) $file | tail -1)
echo "$line"
done
fi

我就这样跑

./script.sh file

文件是：

xxxxxxxx-54.yyyyy
xxxxxxxx-55.yyyyy
xxxxxxxx-60.yyyyy
xxxxxxxx-66.yyyyy

我的问题是如何消除 54 和 55 ，我的意思是除了这两条 54 和 55 行之外，我想要这个列表的 25%，我想在这样的命令中指定它

./script.sh file 54 55

谢谢你。

score 4 · Accepted Answer

除非您知道有多少行代表 100%，否则不可能计算出 25%，因此您的所有解决方案要么（1）单程并将文件存储在内存中，要么（2）多程以收集行数。我不知道您要处理多长时间的文件，但无论如何我更喜欢第二种选择，所以我会这样回答。

如果您运行的是 Linux，那么您可能拥有大多数工具的 GNU 版本。一种解决方案可能是：

#!/bin/sh

linecount=$(awk 'END{printf("%d", NR * 0.25)}' input.txt)
exclude="$@"
egrep -vw "${exclude// /|}" input.txt | shuf -n$linecount

或者：

#!/bin/sh

linecount=$(awk 'END{printf("%d", NR * 0.25)}' input.txt)
exclude="$@"
egrep -vw "${exclude// /|}" input.txt | sort -R | head -n $linecount

此解决方案假定“xxxxxx”和“yyyyy”字符串不包含您要跳过的数字的单词分隔版本。如果可能，那么您可能应该向我们提供更多详细信息，例如实际样本数据。

如果您使用的是 FreeBSD 或 OSX，则sort没有-R选项shuf且不包括在内，但您仍然可以完成此操作。您将jot在系统中调用一个工具。它可以用来产生一个范围内的随机数。所以这有点尴尬，但它有效：

#!/bin/sh

# `awk` is a little heaver than `wc`, but you don't need to parse its output.
lines=$(awk 'END{printf("%d", NR * 0.25)}' input.txt)

exclude="$@"

# First, put a random number at the beginning of each line.
while read line; do
  # skip lines that match our exclusion list
  if [[ $line =~ -($exclude). ]]; then
    continue
  fi
  echo "`jot -r 1 1 10000000` $line"
done < input.txt > stage1.txt

# Next, sort by the random number.
sort -n stage1.txt > stage2.txt

# Last, remove the number from the start of each line.
sed -r 's/^[0-9]+ //' stage2.txt > stage3.txt

# Show our output
head -n $lines stage3.txt

# Clean up
rm stage1.txt stage2.txt stage3.txt

如果您愿意，可以将其中一些行组合起来，以避免将内容暂存到单独的文件中。

#!/bin/sh

lines=$(awk 'END{printf("%d", NR * 0.25)}' input.txt)

exclude="$@"

while read line; do
  if [[ $line =~ -(${exclude// /|})\. ]]; then
    continue
  fi
  echo "`jot -r 1 1 10000000` $line"
done < input.txt | sort -n | sed -r 's/^[0-9]+ //' | head -n $lines

# no clean-up required

score 2 · Accepted Answer

您可以使用一系列 unix 工具。shuf是一个很好的，就像wc和一样awk。使用相同的方法计算行数，然后调整要打印的行数以忽略某些行，然后打印其中的随机数。

num_lines=$(wc -l $file | cut -f1 -d' ' )
high=55
low=54

if [ "$num_lines" -ge $high ]; then : $((num_lines--)); fi
if [ "$num_lines" -ge $low ]; then : $((num_lines--)); fi

awk '(NR != '$low' && NR != '$high') { print }' < $file \
    | shuf -n $((num_lines / 4))

请注意，if 语句的顺序很重要，这样才能发生正确的减法次数（即，如果文件有 54 行，则只应跳过一行，因此只有一次减法，如果它有 55 行，则两次将跳过行，并且此排序是必需的，否则不会发生第二次减法。）

请注意，如果您希望这些行按其原始顺序排列，则可以使用以下命令代替最后一个awk .. | shuf ..管道。

awk '(NR != '$low' && NR != '$high') { print NR,$0 }' < $file \
    | shuf -n $((num_lines / 4)) | sort -n | cut -f2- -d' '

（它首先用它的行号和排序标记每一行，然后将其删除，即Schwartzian Transform。）

shell - 获取文件的 25% 行

2 回答 2

Related

Reference