python - 删除给定索引列表的大字符串的一部分

Question

看起来像一个简单的问题，但是有一个字符串（巨大的，来自一个大文件），如何删除一对索引之间的部分字符串（实际上是一对索引的列表）

例如removeByIndex("Text aaa bbb", [(0,1), (5, 9)])会返回 ext bbb

索引不重叠。

由于内容相对较大（最多数百兆），因此必须高效

摘要：因此，无论如何，解决方案似乎都涉及创建一个新字符串并手动检查索引对列表，并添加不在列表中的索引。

score 6 · Accepted Answer

在你证明它是一个瓶颈之前不要担心性能

s = s[:i] + s[j:]

如果这还不够快，您就不能使用 Python - 或 C。您必须选择更好的数据结构

score 4 · Accepted Answer

假设您的对 (start,end) 不包含最后，我会这样做（我已经嵌入了一个可扩展的测试用例，因此您可以运行一些性能测试）：

N = 100000
s = ''.join([ chr(c % 26 + ord('a')) for c in range(N) ])

l = [ (26*i,26*i+3) for i in range(N//26) ]

l.sort(lambda x, y : cmp(x[0], y[0]))

ns = []
i = 0
for (start,end) in l:
    ns.append(s[i:start])
    i = end
ns.append(s[end:])
s = ''.join(ns)

N = 100 000 000（你的字符串的顺序），这个脚本在 30 秒内运行。它很慢，但可能是可以忍受的。当然，正确的数据结构是解决这个特定问题的绳索。因此，如果您需要进行大量运行，您可能应该放弃 Python 或在 Python 中使用适当的数据结构。

score 1 · Accepted Answer

from itertools import izip
def grouped(iterable, n):
            return izip(*[iter(iterable)]*n)
big_str="12345893483104921420948124"
indexes = [2,4,5,7]
# if needed, indexes = sorted(indexes)
indexes.insert(0, 0)
indexes.append(len(big_str))
sm_str=""
for a,b in grouped(indexes,2):
    sm_str=sm_str+big_str[a:b]

score 0 · Accepted Answer

您需要多快，请尝试：

In [9]: import string

In [10]: import random

In [11]: huge=''.join(random.choice(string.lowercase) for x in range(10000))

In [12]: len(huge)
Out[12]: 10000

In [13]: not_sohuge=huge[0:5000]+huge[6000:]

In [14]: len(not_sohuge)
Out[14]: 9000

一些时间安排：

$ python -m timeit -s 'import random; import string; huge="".join(random.choice(string.lowercase) for x in range(10000))'  'not_sohuge=huge[0:5000]+huge[6000:]'
100000 loops, best of 3: 2.96 usec per loop

python - 删除给定索引列表的大字符串的一部分

4 回答 4

Related

Reference