1

我在一个固定宽度的文件中有几十万个不稳定的值。我想找到字符串 old_values 并将它们替换为 new_values 中相应位置的字符串。我可以循环并一次执行此操作,但我几乎可以肯定有一种更快的方法,我不够专业,无法了解。

old_values = ('0000}', '0000J', '0000K', '0000L', '0000M', '0000N')  # and many more
new_values = ('   -0', '   -1', '   -2', '   -3', '   -4', '   -5')  # and many more
file_snippet = '00000000000000010000}0000000000000000000200002000000000000000000030000J0000100000000000000500000000000000000000000' # each line is >7K chars long and there are over 6 gigs of text data

循环遍历每个值并在每行上运行 .replace 似乎很慢。例如:

for x in len(old_values):
  line.replace(old_values[x], new_values[x])

有什么加快速度的技巧吗?

4

2 回答 2

3

下面的代码将逐个字符地遍历数据并在找到映射时替换它。尽管这假设需要替换的每个数据都是绝对唯一的。

def replacer(instring, mapping):

    item = ''

    for char in instring:
        item += char
        yield item[:-5]
        item = item[-5:]
        if item in mapping:
            yield mapping[item]
            item = ''

    yield item


old_values = ('0000}', '0000J', '0000K', '0000L', '0000M', '0000N')
new_values = ('   -0', '   -1', '   -2', '   -3', '   -4', '   -5')
value_map = dict(zip(old_values, new_values))

file_snippet = '00000000000000010000}0000000000000000000200002000000000000000000030000J0000100000000000000500000000000000000000000' # each line is >7K chars long and there are over 6 gigs of text data

result = ''.join(replacer(file_snippet, value_map))
print result

在您的示例数据中,这给出了:

0000000000000001   -0000000000000000000020000200000000000000000003   -10000100000000000000500000000000000000000000

如果数据适合这种方式,则更快的方法是将数据拆分为 5 个字符的块:

old_values = ('0000}', '0000J', '0000K', '0000L', '0000M', '0000N')
new_values = ('   -0', '   -1', '   -2', '   -3', '   -4', '   -5')
value_map = dict(zip(old_values, new_values))

file_snippet = '00000000000000010000}0000000000000000000200002000000000000000000030000J0000100000000000000500000000000000000000000' # each line is >7K chars long and there are over 6 gigs of text data

result = []
for chunk in [ file_snippet[i:i+5] for i in range(0, len(file_snippet), 5) ]:
    if chunk in value_map:
        result.append(value_map[chunk])
    else:
        result.append(chunk)

result = ''.join(result)
print result

这会导致您的示例数据中没有替换,除非您删除前导零,然后您会得到:

000000000000001   -0000000000000000000020000200000000000000000003   -10000100000000000000500000000000000000000000

和上面一样。

于 2013-09-07T19:42:04.300 回答
2

进行替换映射 ( dict) 使事情变得更快:

import timeit

input_string = '00000000000000010000}0000000000000000000200002000000000000000000030000J0000100000000000000500000000000000000000000'
old_values = ('0000}', '0000J', '0000K', '0000L', '0000M', '0000N')
new_values = ('   -0', '   -1', '   -2', '   -3', '   -4', '   -5')
mapping = dict(zip(old_values,new_values))


def test_replace_tuples(input_string, old_values, new_values):
    for x in xrange(len(old_values)):
        input_string = input_string.replace(old_values[x], new_values[x])
    return input_string


def test_replace_mapping(input_string, mapping):
    for k, v in mapping.iteritems():
        input_string = input_string.replace(k, v)
    return input_string


print timeit.Timer('test_replace_tuples(input_string, old_values, new_values)',
                   'from __main__ import test_replace_tuples, input_string, old_values, new_values').timeit(10000)

print timeit.Timer('test_replace_mapping(input_string, mapping)',
                   'from __main__ import test_replace_mapping, input_string, mapping').timeit(10000)

印刷:

0.0547060966492
0.048122882843

请注意,不同输入的结果可能会有所不同,请在您的真实数据上进行测试。

于 2013-09-07T19:44:47.623 回答