python - 如何从开始到第一个无base64有效字符提取有效字符串？

Question

我有 base64 编码的字符串，但最后有时会出现一些尾随垃圾，这些垃圾总是以没有有效的 base64 字符开头。如何从开头到第一个无base64有效字符提取有效字符串？

例如：

data = "(there  is more valid content)gw3AQEFGh8NFgoqDwYsAQALDVFPVltYVkNGXldCRFUbNRk=----------:jhawrewre:--\r\n"

有效部分将没有"----------:jhawrewre:--\r\n"

valid = "(there  is more valid content)gw3AQEFGh8NFgoqDwYsAQALDVFPVltYVkNGXldCRFUbNRk="

score 1 · Accepted Answer

您可以使用正则表达式来删除无效部分：

import re

invalid_tail = re.compile(r'[^a-zA-Z0-9+/=\n\r].*$')

def remove_tail(base64_value):
    return invalid_tail.sub('', base64_value)

[^a-zA-Z0-9+/=\n\r]匹配任何不是有效 Base64 字符的字符，以及尾随填充=和换行符和回车符（允许在编码值中换行）。

演示：

>>> example = 'The quick brown fox jumps over the lazy dog!'.encode('base64')
>>> remove_tail(example + '*This is a tail').decode('base64')
'The quick brown fox jumps over the lazy dog!'

或者，使用样本的可解码部分：

>>> data = "3AQEFGh8NFgoqDwYsAQALDVFPVltYVkNGXldCRFUbNRk=----------:jhawrewre:--\r\n"
>>> remove_tail(data).decode('base64')
'\xdc\x04\x04\x14h|4X(\xa8<\x18\xb0\x04\x00,5E=YmaY\r\x19y]\t\x11Tl\xd4d'

该解决方案轻松击败了itertools.takewhile()速度选项：

>>> import timeit
>>> text = "gw3AQEFGh8NFgoqDwYsAQALDVFPVltYVkNGXldCRFUbNRk=----------:jhawrewre:--\r\n"
>>> timeit.timeit('test(text)', 'from __main__ import with_takewhile as test, text')
11.785380125045776
>>> timeit.timeit('test(text)', 'from __main__ import with_re as test, text')
1.480334997177124

对于这个简单的示例，使用正则表达式几乎快 10 倍；对于更长的文本，结果会更快。

score 1 · Accepted Answer

您可以使用itertools.takewhile：

制作一个迭代器，只要为predicate真，它就从可迭代对象中返回元素。

演示：

>>> from itertools import takewhile
>>> from string import letters,digits
>>> valid_chars = letters + digits + '+/='
>>> text = "gw3AQEFGh8NFgoqDwYsAQALDVFPVltYVkNGXldCRFUbNRk=----------:jhawrewre:--\r\n"
>>> "".join(takewhile(lambda x:x in valid_chars, text))
'gw3AQEFGh8NFgoqDwYsAQALDVFPVltYVkNGXldCRFUbNRk='

python - 如何从开始到第一个无base64有效字符提取有效字符串？

2 回答 2

Related

Reference