html - 正则表达式保留除 DIV 以外的所有内容

Question

我正在使用 jEdit，并且我有一堆编码错误的 HTML 文件，我想获取其中的主要内容而不是周围的 HTML。

我需要介于两者之间的一切<div class="main-text">和下一个</div>。

必须有一种 REGEX 方式来执行此操作，jEdit 允许我用正则表达式替换和查找。

我不精通正则表达式，我需要很长时间才能解决这个问题 - 任何人都可以快速帮忙吗？

score 1 · Accepted Answer

从字面上理解您的问题，您可以替换：

/.*<div class="main-text">(.*?)<\/div>.*/

使用\1（或$1取决于您的编辑器使用什么）。

然而，他来咬你的小马驹，因为如果你的“正文”元素包含另一个元素<div>怎么办？如果你确定这不会发生，那么你很好。否则，你就麻烦了。用空字符串替换可能更容易/.*<div class="main-text">/，然后手动查找结尾并删除之后的所有内容。

就此而言，此任务可能最容易手动完成，因此您不必在代码运行后仔细检查。

score 0 · Accepted Answer

这个正则表达式应该可以解决您的问题：/<\s*div\s+class="main-text"[^>]*>(.*?)<\/div>/gi

这是 Perl 中的一个示例：

my $str = '<div class="main-text"> and the next </div>';
$str =~ /<\s*div\s+class="main-text"[^>]*>(.*?)<\/div>/gi;
print $1;

该示例在 Perl 中，但正则表达式可以独立应用语言。

这是正则表达式的解释：

/       -start of the regex
   <\s*    -we can have < and whitespace after it
      div     -matches "div"
         \s+     -matches one or more whitespaces after the <div
         class="main-text"    -matches class="main-text" (so <div class="main-text" to here)
         [^>]*       -matches everything except >, this is because you may have more attributes of the div
         >          -matches >, so <div class="main-text"> until now
      (.*?)        -matches everything until </div> and saves it in $1
   <\/div>        -matches </div>, so now we have <div class="main-text">( and the next )</div> until now
/gi       -makes the regex case insensitive

score 0 · Accepted Answer

此正则表达式捕获 html 标记之间的文本

<(?<tag>div).*?>(?<text>.*)</\k<tag>>

分解：

<(?div).*?> : 第一个以 div 开头的标签，这个组被称为 "tag"
(?.*) ：标签之间的文本捕获
> ：结束 div 标签，反向引用名为“tag”的组

最后，捕获的结果给出两组“标签”和“文本”，您的捕获在“文本”中

html - 正则表达式保留除 DIV 以外的所有内容

3 回答 3

Related

Reference