php - 替换缓冲字符串中的文本

Question

我正在开发一个程序，该程序在生成 mysqldump 的输出时对其进行修改，为此我目前有代码以块的形式读取 mysqldump 的输出，大小为固定字节数。我需要能够同时进行正则表达式匹配，以及在读入该文本时对其进行正则表达式替换（在最终文本输出上运行正则表达式是不可能的，因为最终文件大小为许多 GB）。我正在用 PHP 编写代码，尽管我相信问题（及其解决方案）应该与语言无关。

现在我所拥有的伪代码如下所示：

$previous_chunk = "";
while (!end_of_file($reader)) {
    $chunk = $reader.read() //read in a few thousand characters from the file
    $double_chunk = $previous_chunk + $chunk;
    // do regular expressions on the double chunk (to catch matches that span the chunk boundary)
    $output_file.write($chunk);
    $previous_chunk = $chunk;
}

这在两个问题上搁浅。第一个是每个块都被正则表达式计算两次，所以如果一个匹配发生在一个块中（而不是跨越块边界），即使匹配的文本只出现一次，它也会触发两次匹配。第二个问题是这仍然不允许我在比赛中进行替换。正则表达式将替换中的文本，$double_chunk但我只写入$chunk输出文件，不受替换影响。

我的一个想法是我不希望我的任何正则表达式需要跨越多行（由\n字符分隔），所以我可以在我的程序中创建第二个缓冲区，只在完成的行上运行正则表达式，然后编写逐行而不是逐块输出到目标文件。不幸的是，由于 mysqldump 输出的性质，有一些非常长的行（有些实际上是数百兆字节），所以我认为这不是一个可行的选择。

我怎样才能在这个文件中读取一些合理大小的内存占用（比如几十 MB）并使用正则表达式在流中修改它？

score 0 · Accepted Answer

$chunk = $reader.read() //read in exactly $chunk_length characters from the file (or less iff EOF reached)
while (!end_of_file($reader)) {
    $previous_chunk = $chunk;
    $chunk = $reader.read() //read in $chunk_length characters from the file (or less iff EOF reached)

    $double_chunk = $previous_chunk + $chunk;
    // do regular expressions on the double chunk (to catch matches that span the chunk boundary)
    $previous_chunk = substr($double_chunk, 0, $chunk_length);
    $chunk = substr($double_chunk, $chunk_length);
    $output_file.write($previous_chunk);
}

// do regular expressions on $chunk to process the last one (or the first and only one)
$output_file.write($chunk);

问题 1 和 2 都通过执行正则表达式替换来解决，然后将生成的字符串块分配回 $previous_chunk 和 $chunk，假设您用作替换字符串的内容不会重新触发匹配。这将更write改为使用 $previous_chunk，以便在下次有机会捕获跨块匹配时更改 $chunk。

此外，重要的是，以上假设替换与被替换的字符串长度相同。如果不是，那么在替换后块大小会动态变化，上述解决方案太天真而无法处理。如果替换字符串的长度不同，那么您必须以某种方式跟踪不断变化的块边界。

php - 替换缓冲字符串中的文本

1 回答 1

Related

Reference