python - .append() 耗时吗？

Question

这些天我一直在处理巨大的文本文件。有时我需要删除行。我的做法如下：

f=open('txt','r').readlines()
list=[]
for line in f:
    if blablablabla:
       list.append(line)

我知道对于大文件， .readlines() 是限速步骤，但是 .append() 步骤呢？在 readlines 之后追加是否会花费大量额外的时间？如果是这样，也许我应该想办法直接删除我不想要的行，而不是附加我想要的行。

谢谢

score 5 · Accepted Answer

readlines()如果您稍后要过滤它，为什么要在 using 中读取整个文件？只需遍历它即可保存您想要保留的行。您可以使用列表理解将其减少到几行：

with open('txt', 'r') as f:
    myList = [ line for line in f if blablablabla ]

score 2 · Accepted Answer

作为一般提示，请改为执行此操作，无需在迭代之前先读取完整文件......

with open('txt') as fd:
    for line in fd:
        if blablabla:
            my_list.append(line)

并且不要将列表称为“列表”...

score 1 · Accepted Answer

在这篇文章中，我试图解释列表的工作方式以及为什么 append 不是很昂贵。我还在底部发布了一个解决方案，您可以使用它来删除行。

Python 列表的结构就像一个节点网络：

>>> class ListItem:
        def __init__(self, value, next=None):
            self.value = value
            self.next = next
        def __repr__(self):
            return "Item: %s"%self.value


>>> ListItem("a", ListItem("b", ListItem("c")))
Item: a
>>> mylist = ListItem("a", ListItem("b", ListItem("c")))
>>> mylist.next.next
Item: c

因此， append 基本上就是这样：

ListItem(mynewvalue, oldlistitem)

Append 没有太多开销，但insert()另一方面需要您重建整个列表，因此会花费更多时间。

>>> from timeit import timeit
>>> timeit('a=[]\nfor i in range(100): a.append(i)', number=1000)
0.03651859015577941
>>> timeit('a=[]\nfor i in range(100): a.insert(0, i)', number=1000)
0.047090002177625934
>>> timeit('a=[]\nfor i in range(100): a.append(i)', number=10000)
0.18015429656996673
>>> timeit('a=[]\nfor i in range(100): a.insert(0, i)', number=10000)
0.35550057300308424

如您所见，插入要慢得多。如果我是你，我会删除你不需要的行，马上写回去。

with open("large.txt", "r") as fin:
    with open("large.txt", "w") as f:
        for line in fin:
            if myfancyconditionismet:
                # write the line to the file again
                f.write(line + "\n")
            # otherwise it is gone

有我的解释和解决方案。

-Sunjay03

score 1 · Accepted Answer

您应该使用列表理解，而不是 Jeff 的回答。根据您需要如何处理数据，您可以使用生成器表达式来代替。

回答关于 append() 的问题

Python 列表在末尾预先分配了一些额外的空间。这意味着追加非常快 - 直到您用完预先分配的空间。每当列表扩展时，都会分配一个新的内存块，并将所有引用复制到它。随着列表的增长，额外预分配空间的大小也会增长。这样做是为了使 append 摊销 O(1)。即追加的平均时间是快速且恒定的

score 0 · Accepted Answer

也许你想把它全部拉到内存中，然后对其进行操作。也许一次在一条线上操作更有意义。从您的解释中不清楚哪个更好。

无论如何，无论您采用哪种方法，这里都有相当标准的代码：

# Pull one line into memory at a time
with open('txt','r') as f:
    lineiter = (line for line in f if blablablabla)
    for line in lineiter:
        # Do stuff

# Read the whole file into memory then work on it
with open('txt','r') as f:
    lineiter = (line for line in f if blablablabla)
    mylines = [line for line in lineiter]

如果您走前一条路线，我建议您阅读生成器。Dave Beazley 有一篇很棒的关于生成器的文章，名为“系统程序员的生成器技巧”。强烈推荐。

python - .append() 耗时吗？

5 回答 5

Related

Reference