html - 如何根据 RegEx 模式将文件拆分为多个文件？

Question

我想根据特定的正则表达式模式将一个文件拆分为多个文件。我在下面提供了一个可重现的示例。如果有更简单的解决方案，我也欢迎！

我有一个包含以下文件的目录：

page1.html page2.html page3.html

假设我的 page1.html 看起来像这样：

<strong>Hello world</strong>

<p>ABC, Page (1 whatever).</p>
<p>Some text</p>

<p>DEF, Page (1 ummm what).</p>
<p>Some text</p>

<p>THE<em><strong><span class="underline">GHI</span></strong></em>JK <em><strong><span class="underline">the</span></strong></em>LMNOP<em><strong><span class="underline">Q</span></strong></em>RS.<p> ABC, Page (1).</p>

我想将 page1.html 拆分为：

page1_0.html

<strong>Hello world</strong>

page1_1.html

<p>ABC, Page (1 whatever).</p>
<p>Some text</p>

page1_2.html

<p>DEF,  Page (1 ummm what).</p>
<p>Some text</p>

<p>THE<em><strong><span class="underline">GHI</span></strong></em>JK <em><strong><span class="underline">the</span></strong></em>LMNOP<em><strong><span class="underline">Q</span></strong></em>RS.<p> ABC, Page (1).</p>

我想要用以下模式识别行的代码：

[0 to 10 characters in the beginning] , Page (1 [0 to 10 characters here]). </p>

我目前有以下代码：

for filename in *.html; gcsplit -z -f "${filename%.*}_" --suffix-format="%d.html" $filename /'Page (1'/ '{*}'

但这是创建一个 page1_3.html 包含以下文本：

<p>THE<em><strong><span class="underline">GHI</span></strong></em>JK <em><strong><span class="underline">the</span></strong></em>LMNOP<em><strong><span class="underline">Q</span></strong></em>RS.<p> ABC, Page (1).</p>

但是当我运行这个时：

for filename in *.html; gcsplit -z -f "${filename%.*}_" --suffix-format="%d.html" $filename /'^.{0,10}, Page \(1.{0,10}\).\<\/p\>'/ '{*}'

这只是输出文件 page1_0.html。

我的正则表达式有什么问题？有没有其他方法可以实现我想要做的事情？

score 1 · Accepted Answer

您可以使用这个简短的 Perl 脚本来完成。

#chunker.pl
use 5.022;
use strict;
use diagnostics;
use B "perlstring";

our $i = 0;
our $fmt = "page1_%d.html";
our $fn = sprintf $fmt, $i;

open our $fh, ">", $fn or die $!;
print "opened $fn\n";
while (<<>>) {
  printf "read line $.: %s\n", perlstring $_;
  if (m{^.{0,10}?, Page \(1 [^)]{0,10}?\)\.</p>}) {
    print "break matched line $.\n";
    $fn = sprintf $fmt, ++$i;
    open $fh, ">", $fn or die $!;
    print "opened $fn\n";
  }
  print $fh $_;
}

印刷：

$ perl chunker.pl page1.html

opened page1_0.html
read line 1: "<strong>Hello world</strong>\n"
read line 2: "\n"
read line 3: "<p>ABC, Page (1 whatever).</p>\n"
break matched line 3
opened page1_1.html
read line 4: "<p>Some text</p>\n"
read line 5: "\n"
read line 6: "<p>DEF, Page (1 ummm what).</p>\n"
break matched line 6
opened page1_2.html
read line 7: "<p>Some text</p>\n"
read line 8: "\n"
read line 9: "<p>THE<em><strong><span class=\"underline\">GHI</span></strong></em>JK <em><strong><span class=\"underline\">the</span></strong></em>LMNOP<em><strong><span class=\"underline\">Q</span></strong></em>RS.<p> ABC, Page (1).</p>\n"
read line 10: "\n"
read line 11: "\n"



$ for f in page1_*.html; do echo "$f:"; cat $f; echo; done;
page1_0.html:
<strong>Hello world</strong>


page1_1.html:
<p>ABC, Page (1 whatever).</p>
<p>Some text</p>


page1_2.html:
<p>DEF, Page (1 ummm what).</p>
<p>Some text</p>

<p>THE<em><strong><span class="underline">GHI</span></strong></em>JK <em><strong><span class="underline">the</span></strong></em>LMNOP<em><strong><span class="underline">Q</span></strong></em>RS.<p> ABC, Page (1).</p>

我认为您的正则表达式的问题在于您需要非贪婪匹配。

.{0,10}?零到十最少
, Page $1
[^)]{0,10}?零到十个非右括号最少
$\.</p>然后是关闭

高温高压

score 1 · Accepted Answer

^.{0,10}, Page $1.{0,10}$.\<\/p\>

我的正则表达式有什么问题？

它不是 POSIX BRE。试试^.\{0,10\}, Page (1.\{0,10\}).<\/p>。

/只是\/因为它被用作工具的/REGEXP/[offset]参数csplit。您可能希望将最后一个更改.为\.匹配您的点字符。

html - 如何根据 RegEx 模式将文件拆分为多个文件？

2 回答 2

Related

Reference