python - 在我写完它们后删除python中的行

Question

好的，这是我现有的代码：

////////////// = []
for line in datafile:
    splitline = line.split()
    for item in splitline:
        if not item.endswith("JAX"):
            if item.startswith("STF") or item.startswith("BRACKER"):
                //////////.append( item )


for line in //////////
    print /////////////
   /////////// +=1
    for t in//////
        if t in line[:line.find(',')]:
            line = line.strip().split(',')
           ///////////////write(','.join(line[:3]) + '\n') 
            break

/////////////.close()
/////////////close()
///////////.close()

我想进一步优化。文件真的很大。我想在匹配后删除已匹配的行并将其写入小文件，以减少搜索大文件所需的时间。关于我应该如何去做的任何建议？

score 1 · Accepted Answer

您不能删除文本文件中的行 - 它需要在删除的行之后移动所有数据以填补空白，并且效率非常低。

一种方法是编写一个临时文件，其中包含要保留在 bigfile.txt 中的所有行，并在完成处理后删除 bigfile.txt 并重命名临时文件以替换它。

或者，如果 bigfile.txt 足够小以适合内存，您可以将整个文件读入列表并从列表中删除行，然后将列表写回磁盘。

我还会从您的代码中猜测 bigfile.txt 是某种 CSV 文件。如果是这样，那么最好将其转换为数据库文件并使用 SQL 来查询它。Python 带有内置的 SQLite 模块，并且大多数其他数据库都有 3rd 方库。

score 0 · Accepted Answer

正如我在评论中所说，在我看来，“大文件”的大小不应该减慢计数增加的速度。当你像这样迭代一个文件时，Python 只是按顺序一次读取一行。

此时您可以进行的优化取决于matchedLines 的大小，以及matchedLines 字符串与您正在查看的文本之间的关系。

如果matchedLines 很大，您可以通过只执行一次“查找”来节省时间：

for line in completedataset:
   text = line[:line.find(',')] 
   for t in matchedLines:
        if t in text:
            line = line.strip().split(',')
            smallerdataset.write(','.join(line[:3]) + '\n') 
            break

在我的测试中，“查找”大约需要 300 纳秒，所以如果matchedLines 有几百万个项目长，那么你就可以多花一秒。

如果您正在寻找完全匹配，而不是子字符串匹配，您可以通过使用集合来加快速度：

matchedLines = set(matchedLines)
for line in completedataset:
    target = line[:line.find(',')]
    ## One lookup and you're done!
    if target in matchedLines:
        line = line.strip().split(',')
        smallerdataset.write(','.join(line[:3]) + '\n')

如果不匹配的目标文本看起来与匹配的文本完全不同（例如，大多数目标是随机字符串，matchedLines 是一堆名称）并且matchedLines 都超过一定长度，你可以尝试通过检查子字符串变得非常聪明。假设所有matchedLines至少有5个字符长......

def subkeys(s):
    ## e.g. if len(s) is 7, return s[0:5], s[1:6], s[2:7].
    return [s[i:i+5] for i in range(len(s) + 1 - 5)]

existing_subkeys = set()
for line in matchedLines:
    existing_subkeys.update(subkeys(line))

for line in completedataset:
    target = line[:line.find(',')]
    might_match = False
    for subkey in subkeys(target):
        if subkey in existing_subkeys:
            might_match = True
            break
    if might_match:
        # Then we have to do the old slow way.
        for matchedLine in matchedLines:
            if matchedLine in target:
                # Do the split and write and so on.

但是尝试做这样的事情很容易超越自己，这取决于你的数据是什么样的。

python - 在我写完它们后删除python中的行

2 回答 2

Related

Reference