1

我想根据特定的正则表达式模式将一个文件拆分为多个文件。我在下面提供了一个可重现的示例。如果有更简单的解决方案,我也欢迎!

我有一个包含以下文件的目录:

page1.html page2.html page3.html

假设我的 page1.html 看起来像这样:

<strong>Hello world</strong>

<p>ABC, Page (1 whatever).</p>
<p>Some text</p>

<p>DEF, Page (1 ummm what).</p>
<p>Some text</p>

<p>THE<em><strong><span class="underline">GHI</span></strong></em>JK <em><strong><span class="underline">the</span></strong></em>LMNOP<em><strong><span class="underline">Q</span></strong></em>RS.<p> ABC, Page (1).</p>

我想将 page1.html 拆分为:

page1_0.html

<strong>Hello world</strong>

page1_1.html

<p>ABC, Page (1 whatever).</p>
<p>Some text</p>

page1_2.html

<p>DEF,  Page (1 ummm what).</p>
<p>Some text</p>

<p>THE<em><strong><span class="underline">GHI</span></strong></em>JK <em><strong><span class="underline">the</span></strong></em>LMNOP<em><strong><span class="underline">Q</span></strong></em>RS.<p> ABC, Page (1).</p>

我想要用以下模式识别行的代码:

[0 to 10 characters in the beginning] , Page (1 [0 to 10 characters here]). </p>

我目前有以下代码:

for filename in *.html; gcsplit -z -f "${filename%.*}_" --suffix-format="%d.html" $filename /'Page (1'/ '{*}'

但这是创建一个 page1_3.html 包含以下文本:

<p>THE<em><strong><span class="underline">GHI</span></strong></em>JK <em><strong><span class="underline">the</span></strong></em>LMNOP<em><strong><span class="underline">Q</span></strong></em>RS.<p> ABC, Page (1).</p>

但是当我运行这个时:

for filename in *.html; gcsplit -z -f "${filename%.*}_" --suffix-format="%d.html" $filename /'^.{0,10}, Page \(1.{0,10}\).\<\/p\>'/ '{*}'

这只是输出文件 page1_0.html。

我的正则表达式有什么问题?有没有其他方法可以实现我想要做的事情?

4

2 回答 2

1

您可以使用这个简短的 Perl 脚本来完成。

#chunker.pl
use 5.022;
use strict;
use diagnostics;
use B "perlstring";

our $i = 0;
our $fmt = "page1_%d.html";
our $fn = sprintf $fmt, $i;

open our $fh, ">", $fn or die $!;
print "opened $fn\n";
while (<<>>) {
  printf "read line $.: %s\n", perlstring $_;
  if (m{^.{0,10}?, Page \(1 [^)]{0,10}?\)\.</p>}) {
    print "break matched line $.\n";
    $fn = sprintf $fmt, ++$i;
    open $fh, ">", $fn or die $!;
    print "opened $fn\n";
  }
  print $fh $_;
}

印刷:

$ perl chunker.pl page1.html

opened page1_0.html
read line 1: "<strong>Hello world</strong>\n"
read line 2: "\n"
read line 3: "<p>ABC, Page (1 whatever).</p>\n"
break matched line 3
opened page1_1.html
read line 4: "<p>Some text</p>\n"
read line 5: "\n"
read line 6: "<p>DEF, Page (1 ummm what).</p>\n"
break matched line 6
opened page1_2.html
read line 7: "<p>Some text</p>\n"
read line 8: "\n"
read line 9: "<p>THE<em><strong><span class=\"underline\">GHI</span></strong></em>JK <em><strong><span class=\"underline\">the</span></strong></em>LMNOP<em><strong><span class=\"underline\">Q</span></strong></em>RS.<p> ABC, Page (1).</p>\n"
read line 10: "\n"
read line 11: "\n"



$ for f in page1_*.html; do echo "$f:"; cat $f; echo; done;
page1_0.html:
<strong>Hello world</strong>


page1_1.html:
<p>ABC, Page (1 whatever).</p>
<p>Some text</p>


page1_2.html:
<p>DEF, Page (1 ummm what).</p>
<p>Some text</p>

<p>THE<em><strong><span class="underline">GHI</span></strong></em>JK <em><strong><span class="underline">the</span></strong></em>LMNOP<em><strong><span class="underline">Q</span></strong></em>RS.<p> ABC, Page (1).</p>


我认为您的正则表达式的问题在于您需要非贪婪匹配。

.{0,10}?零到十最少
, Page \(1
[^)]{0,10}?零到十个非右括号最少
\)\.</p>然后是关闭

高温高压

于 2020-12-16T04:24:11.930 回答
1

^.{0,10}, Page \(1.{0,10}\).\<\/p\>

我的正则表达式有什么问题?

它不是 POSIX BRE。试试^.\{0,10\}, Page (1.\{0,10\}).<\/p>

/只是\/因为它被用作工具的/REGEXP/[offset]参数csplit。您可能希望将最后一个更改.\.匹配您的点字符。

于 2020-12-16T14:04:57.690 回答