python - 如何替换python中的特定模式

Question

我想替换所有出现的大于 2147483647 的整数，然后是^^<int>数字的前 3 位数字。例如，我的原始数据为：

<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question"  <at> "25500000000"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.

我想用下面提到的数据替换原始数据：

 <stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
 "Ask a Question"  <at> "255"^^<int> <stack_overflow> .
 <basic> "language" "89028899" <html>.

我实现的方式是逐行扫描数据。如果我找到大于 2147483647 的数字，我将它们替换为前 3 位数字。但是，我不知道应该如何检查字符串的下一部分是否为^^<int>.

我想要做的是：对于大于 2147483647 的数字，例如 25500000000，我想用数字的前 3 位数字替换它们。由于我的数据大小为 1 TB，因此非常感谢更快的解决方案。

score 3 · Accepted Answer

使用re模块构造正则表达式：

regex = r"""
(                # Capture in group #1
    "[\w\s]+"    # Three sequences of quoted letters and white space characters
    \s+          # followed by one or more white space characters
    "[\w\s]+"
    \s+
    "[\w\s]+"
    \s+
)
"(\d{10,})"      # Match a quoted set of at least 10 integers into group #2
(^^\s+\.\s+)     # Match by two circumflex characters, whitespace and a period
                 # into group #3
(.*)             # Followed by anything at all into group #4
"""

COMPILED_REGEX = re.compile(regex, re.VERBOSE)

接下来，我们需要定义一个回调函数（因为re.RegexObject.sub需要回调）来处理替换：

def replace_callback(matches):
    full_line = matches.group(0)
    number_text = matches.group(2)
    number_of_interest = int(number_text, base=10)
    if number_of_interest > 2147483647:
        return full_line.replace(number_of_interest, number_text[:3])
    else:
        return full_line

然后查找并替换：

fixed_data = COMPILED_REGEX.sub(replace_callback, YOUR_DATA)

如果您有 TB 的数据，您可能不想在内存中执行此操作 - 您需要打开文件然后对其进行迭代，逐行替换数据并将其写回另一个文件（毫无疑问加快速度的方法，但它们会使该技术的要点更难遵循：

# Given the above
def process_data():
    with open("path/to/your/file") as data_file,
         open("path/to/output/file", "w") as output_file:
         for line in data_file:
             fixed_data = COMPILED_REGEX.sub(replace_callback, line)
             output_file.write(fixed_data)

score 1 · Accepted Answer

如果您的文本文件中的每一行看起来都像您的示例，那么您可以这样做：

In [2078]: line = '"QuestionAndAnsweringWebsite" "fact". "Ask a Question" "25500000000"^^ . "language" "89028899"'

In [2079]: re.findall('\d+"\^\^', line)
Out[2079]: ['25500000000"^^']

with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
    for line in infile:
        for found in re.findall('\d+"\^\^', line):
            if int(found[:-3]) > 2147483647:
                line = line.replace(found, found[:3])
        outfile.write(line)

由于内部的 for 循环，这有可能成为一个低效的解决方案。但是，我目前想不出更好的正则表达式，所以这至少应该让你开始

python - 如何替换python中的特定模式

2 回答 2

Related

Reference