python - 与使用 python 的大文件 B 相比，从大文件 A 中找到唯一行的最快方法是什么？

Question

我得到了包含 300、000+ 行的 txt 文件 A 和 600、000+ 行的 txt 文件 B。现在我要做的是逐行筛选文件A，如果该行没有出现在文件B中，那么它将被附加到文件C中。

好吧，问题是如果我像上面所说的那样编程，完成所有工作实际上需要很长时间。那么有没有更好的方法来做到这一点？

score 14 · Accepted Answer

这应该很快：

with open("a.txt") as a:
    with open("b.txt") as b:
        with open("c.txt", "w") as c:
            c.write("".join(set(a) - set(b)))

请注意，这将忽略 A 或 B 中的任何订单。如果您绝对需要保留来自 A 的订单，您可以使用以下命令：

with open("a.txt") as a:
    with open("b.txt") as b:
        with open("c.txt", "w") as c:
            b_lines = set(b)
            c.write("".join(line for line in a if not line in b_lines))

score 1 · Accepted Answer

你能记住B吗？如果是这样，请读取文件 B 并使用它包含的所有行创建一个索引。然后逐行阅读 A 并检查每一行是否出现在您的索引中。

with open("B") as f:
    B = set(f.readlines())

with open("A") as f:
    for line in f.readlines():
        if line not in B:
           print(line)

score 0 · Accepted Answer

对python一无所知，但是：如何将文件A排序为特定顺序？然后你可以逐行浏览文件 B 并进行二进制搜索 - 更有效。

score 0 · Accepted Answer

将文件 B 中的所有行读入 a set：

blines = set(file_b)
for line in file_a:
    if not line in blines:
       append_to_file_c

600k+ 并不是真的那么多数据...

python - 与使用 python 的大文件 B 相比，从大文件 A 中找到唯一行的最快方法是什么？

4 回答 4

Related

Reference