python - 在更改值时使用 python 拆分排序文件

Question

我是 python 新手。我的要求（如果我必须使用 awk 的话很简单）如下所示，

下面提到的文件（test.txt）是制表符分隔的，

1 a b c
1 a d e
1 b d e
2 a b c
2 a d e
3 x y z

我想要的输出

文件 1.txt 应具有以下值

a b c
a d e
b d e

文件 2.txt 应具有以下值

a b c
a d e

文件 3.txt 应具有以下值

x y z

原始文件按第一列排序。我不知道我必须拆分的行号。它必须在价值变化上。使用awk，我会这样写

awk -F"\t" 'BEGIN {OFS="\t";} {print $2","$3","$4 > $1}' test.txt

（性能方面，python 会更好吗？）

score 1 · Accepted Answer

awk 非常适合这一点，并且应该更快。速度真的是个问题吗，你的输入有多大？

$ awk '{print $2,$3,$4 > ("file"$1)}' OFS='\t' file

演示：

$ ls
file

$ cat file
1 a b c
1 a d e
1 b d e
2 a b c
2 a d e
3 x y z

$ awk '{print $2,$3,$4 > ("file"$1)}' OFS='\t' file

$ ls
file  file1  file2  file3

$ cat file1
a b c
a d e
b d e

$ cat file2 
a b c
a d e

$ cat file3
x y z

score 0 · Accepted Answer

如果您有一个非常大的文件，awk 将在每一行打开和关闭一个文件来执行附加，不是吗？如果这是一个问题，那么 C++ 具有速度和容器类，可以很好地处理任意数量的打开的输出文件，以便每个文件只打开和关闭一次。不过，这被标记为 Python，假设 I/O 时间占主导地位，它的速度几乎一样快。

避免 Python 中额外的打开/关闭开销的版本：

# iosplit.py

def iosplit(ifile, ifname="", prefix=""):
    ofiles = {}
    try:
        for iline in ifile:
            tokens = [s.strip() for s in iline.split('\t')]
            if tokens and tokens[0]:
                ofname = prefix + str(tokens[0]) + ".txt"
                if ofname in ofiles:
                    ofile = ofiles[ofname]
                else:
                    ofile = open(ofname, "w+")
                    ofiles[ofname] = ofile
                ofile.write( '\t'.join(tokens[1:]) + '\n')
    finally:
        for ofname in ofiles:
            ofiles[ofname].close()

if __name__ == "__main__":
    import sys
    ifname = (sys.argv + ["test.txt"])[1]
    prefix = (sys.argv + ["", ""])[2]
    iosplit(open(ifname), ifname, prefix)

命令行用法是python iosplit.py

默认为空，并将添加到每个输出文件名之前。调用程序提供一个文件（或类似文件的对象），因此您可以使用 StringIO 对象甚至字符串列表/元组来驱动它。

警告：这将删除行中制表符之前或之后的所有空格。内部空间不会被触及。所以 "1\ta b \tc \t d" 在写入 1.txt 时会被转换为 "ab\tc\td"。

score 0 · Accepted Answer

我的版本：

for line in open('text.txt', 'r'):
    line = line.split(' ')
    doc_name = line[0]
    content = ' '.join(line[1:]) 

    f = open('file' + doc_name, 'a+')
    f.write(content)

score 0 · Accepted Answer

像这样的东西应该做你想做的事。

import itertools as it

with open('test.txt') as in_file:
    splitted_lines = (line.split(None, 1) for line in in_file)
    for num, group in it.groupby(splitted_lines, key=lambda x: x[0]):
        with open(num + '.txt', 'w') as out_file:
            out_file.writelines(line for _, line in group)

该with声明允许安全地使用资源。在这种情况下，它们会自动关闭文件。
该splitted_lines = (...)行在接受每一行的字段上创建一个可迭代对象，并产生一对第一个元素，其余行。
itertools.groupby函数是完成大部分工作的函数。它遍历文件的行并根据第一个元素对它们进行分组。
(line for _, line in group)迭代“分割线” 。它只是删除第一个元素并仅写入其余行。（与_其他任何标识符一样，这只是一个标识符。我可以使用xor first，但我_经常用来表示您必须分配但您不使用的东西）

我们可能可以简化代码。例如，最外层with不太可能有用，因为我们只是在阅读模式下打开文件，而不是修改它。删除它我们可以取消缩进：

import itertools as it

splitted_lines = (line.split(None, 1) for line in open('test.txt'))
for num, group in it.groupby(splitted_lines, key=lambda x: x[0]):
    with open(num + '.txt', 'w') as out_file:
        out_file.writelines(line for _, line in group)

我做了一个非常简单的基准测试来测试 python 解决方案与 awk 解决方案。性能大致相同，python使用每行有 10 个字段的文件稍快一些，并且有 100 个“行组”，每个随机大小在 2 到 30 个元素之间。

python代码的时序：

In [22]: from random import randint
    ...: 
    ...: with open('test.txt', 'w') as f:
    ...:     for count in range(1, 101):
    ...:         num_nums = randint(2, 30)
    ...:         for time in range(num_nums):
    ...:             numbers = (str(randint(-1000, 1000)) for _ in range(10))
    ...:             f.write('{}\t{}\n'.format(count, '\t'.join(numbers)))
    ...:             

In [23]: %%timeit
    ...: splitted_lines = (line.split(None, 1) for line in open('test.txt'))
    ...: for num, group in it.groupby(splitted_lines, key=lambda x: x[0]):
    ...:     with open(num + '.txt', 'w') as out_file:
    ...:         out_file.writelines(line for _, line in group)
    ...: 
10 loops, best of 3: 11.3 ms per loop

awk 时间：

$time awk '{print $2,$3,$4 > ("test"$1)}' OFS='\t' test.txt

real    0m0.014s
user    0m0.004s
sys     0m0.008s

请注意，0.014s大约是14 ms.

无论如何，根据操作系统负载，时间可能会有所不同，并且实际上它们同样快。实际上，几乎所有时间都在读取/写入文件，这可以通过 python 和 awk 有效地完成。我相信使用 C 你不会看到巨大的速度提升。

python - 在更改值时使用 python 拆分排序文件

4 回答 4

Related

Reference