python - 什么是更改大文件的有效方法

Question

我有一个巨大的文件（~2000000 行），我试图替换几个不同的模式，而我只读取一次文件。

所以我猜 sed 不好，因为我有不同的模式我尝试将 awk 与 if else 一起使用，但文件没有改变

#!/usr/bin/awk -f
{

    if($0 ~ /data for AAA/)
    {

        sub(/^[0-9]+$/, "bla_AAA", $2)

    }
    if($0 ~ /data for BBB/)
    {

        sub(/^[0-9]+$/, "bla_BBB", $2)

    }


}

我期望的输出

address 01000 data for AAA
....
address 02000 data for BBB
....

成为

address bla_AAA data for AAA
....
address bla_BBB data for BBB
....

score 1 · Accepted Answer

我在您的问题中没有看到任何迹象表明您的文件确实很大，因为 2000000 行没什么，而且您问题中的每个示例行都很小，所以这就是您所需要的全部：

awk '
/data for AAA/ { $2 = "bla_AAA"; next }
/data for BBB/ { $2 = "bla_BBB"; next }
' file > tmp && mv tmp file

GNU awk 可以-i inplace选择执行与 sed、perl 等相同的“就地”编辑（即内部使用 tmp 文件）。

如果您真的没有足够的存储空间来创建输入文件的副本，那么您可以使用这样的东西（未经测试！）：

headLines=10000
beg=1
tmp=$(mktemp) || exit 1
while -s file; do
    head -n "$headLines" file | awk 'above script' >> "$tmp" &&
    headBytes=$(head -n "$headLines" file |wc -c) &&
    dd if=file bs="$headBytes" skip=1 conv=notrunc of=file &&
    truncate -s "-$headBytes" file
    rslt=$?
done
(( rslt == 0 )) && mv "$tmp" file

因此，您使用的存储空间永远不会超过输入文件加上headLines行的大小（按摩该数字以适应）。请参阅https://stackoverflow.com/a/17331179/1745001了解有关它在做什么truncate和前两行的信息。

score 0 · Accepted Answer

像这样的东西：

（读取一行，进行文本操作，将修改后的数据写入输出文件）

with open('in.txt') as f_in:
    with open('out.txt', 'w') as f_out:
        line = f_in.readline().strip()
        while line:
            fields = line.split(' ')
            fields[1] = 'bla_{}'.format(fields[4])
            f_out.write(' '.join(fields) + '\n')
            line = f_in.readline()

python - 什么是更改大文件的有效方法

2 回答 2

Related

Reference