linux - Bash - 交换列中的值

Question

我在一个文件中有一些 CSV/表格数据，如下所示：

1,7,3,2
8,3,8,0
4,9,5,3
8,5,7,3
5,6,1,9

（它们并不总是数字，只是随机的逗号分隔值。但是，单个数字更容易作为示例。）

我想随机洗牌任何列的 40%。例如，说第三个。所以也许 3 和 1 互相交换。现在第三列是：

1 << Came from the last position
8
5
7
3 << Came from the first position

我正在尝试在bash我正在处理的脚本中的文件中执行此操作，但我运气不佳。我一直在一些非常疯狂和徒劳的兔子洞里徘徊，grep这让我觉得我走错了路（不断的失败是我的罪魁祸首）。

我用一连串的东西标记了这个问题，因为我不完全确定我什至应该使用哪种工具。

编辑：我可能最终会接受鲁本斯的回答，无论它多么古怪，因为它直接包含交换概念（我想我本可以在原始问题中更加强调），它允许我指定一个百分比用于交换的列。它也恰好起作用，这总是一个优点。

对于不需要这个，只想要一个基本的洗牌的人，吉姆加里森的答案也有效（我测试过）。

然而，对鲁本斯的解决方案提出警告。我拿了这个：

for (i = 1; i <= NF; ++i) {
  delim = (i != NF) ? "," : "";
  ...
}
printf "\n";

删除printf "\n";并将换行符向上移动，如下所示：

for (i = 1; i <= NF; ++i) {
  delim = (i != NF) ? "," : "\n";
  ...
}

因为只是""在 else 情况下会导致awk在每行的末尾写入损坏的字符（\00）。在某一时刻，它甚至设法用汉字替换了我的整个文件。虽然，老实说，这可能让我在这个问题之上做了一些更愚蠢的事情。

score 4 · Accepted Answer

这将适用于专门指定的列，但应该足以为您指明正确的方向。这适用于包括 Cygwin 在内的现代 bash shell：

paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat)

操作特征是“过程替代”。

该paste命令水平连接文件，三个部分通过与原始文件分开cut，第二部分（要随机化的列）通过shuf命令运行以重新排序行。这是运行它几次的输出：

$ cat test.dat
1,7,3,2
8,3,8,0
4,9,5,3
8,5,7,3
5,6,1,9

$ paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat)
1,7,1,2
8,3,8,0
4,9,7,3
8,5,3,3
5,6,5,9

$ paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat)
1,7,8,2
8,3,1,0
4,9,3,3
8,5,7,3
5,6,5,9

score 1 · Accepted Answer

算法：

创建一个向量n对，从1tonumber of lines和行中的相应值（对于选定的列），然后随机排序；
找出应该随机化多少行：num_random = percentage * num_lines / 100;
num_random从您的随机向量中选择第一个条目；
您可以对选定的行进行随机排序，但它应该已经随机排序；

打印输出：

i = 0
for num_line, value in column; do
    if num_line not in random_vector:
        print value; # printing non-randomized value
    else:
        print random_vector[i]; # randomized entry
        i++;
done

实施：

#! /bin/bash

infile=$1
col=$2
n_lines=$(wc -l < ${infile})
prob=$(bc <<< "$3 * ${n_lines} / 100")

# Selected lines
tmp=$(tempfile)
paste -d ',' <(seq 1 ${n_lines}) <(cut -d ',' -f ${col} ${infile}) \
    | sort -R | head -n ${prob} > ${tmp}

# Rewriting file
awk -v "col=$col" -F "," '
(NR == FNR) {id[$1] = $2; next}
(FNR == 1) {
    i = c = 1;
    for (v in id) {value[i] = id[v]; ++i;}
}
{
    for (i = 1; i <= NF; ++i) {
        delim = (i != NF) ? "," : "";
        if (i != col) {printf "%s%c", $i, delim; continue;}
        if (FNR in id) {printf "%s%c", value[c], delim; c++;}
        else {printf "%s%c", $i, delim;}
    }
    printf "\n";
}
' ${tmp} ${infile}

rm ${tmp}

如果您想要更接近的就地方法，您可以使用海绵将输出传回输入文件。

执行：

要执行，只需使用：

$ ./script.sh <inpath> <column> <percentage>

如：

$ ./script.sh infile 3 40
1,7,3,2
8,3,8,0
4,9,1,3
8,5,7,3
5,6,5,9

结论：

这允许您选择列，随机排序该列中的条目百分比，并替换原始文件中的新列。

这个脚本与众不同，不仅证明 shell 脚本非常有趣，而且在某些情况下绝对不应该使用它。（：

score 0 · Accepted Answer

我将使用 2-pass 方法，首先计算行数并将文件读入数组，然后使用 awk 的 rand() 函数生成随机数以识别您将更改的行，然后再进行 rand () 再次确定将交换哪些行对，然后在打印之前交换数组元素。像这样的伪代码，粗略的算法：

awk -F, -v pct=40 -v col=3 '
NR == FNR {
    array[++totNumLines] = $0
    next
}

FNR == 1{
    pctNumLines = totNumLines * pct / 100

    srand()

    for (i=1; i<=(pctNumLines / 2); i++) {
        oldLineNr = rand() * some factor to produce a line number that's in the 1 to totNumLines range but is not already recorded as processed in the "swapped" array.
        newLineNr = ditto plus must not equal oldLineNr

        swap field $col between array[oldLineNr] and array[newLineNr]

        swapped[oldLineNr]
        swapped[newLineNr]
    }
    next
}

{ print array[FNR] }

' "$file" "$file" > tmp &&
mv tmp "$file"

linux - Bash - 交换列中的值

3 回答 3

Related

Reference