regex - 正则表达式：找到一个模式，然后是另一个有间隙的模式

Question

我有一个包含数百条 SQL 插入语句的文件。我只想识别那些以 HTML 段落标记开头但没有结尾 para 标记的语句。

我正在尝试这些线路

<p>[^\n]*(?!</p>) <-- a <p> followed by any number of characters until \n and then </p>

这不起作用。下面是样本数据

INSERT INTO `help` VALUES 
(1,1,'<p>Radiotherapy uses a beam of high&#45;energy rays (or particles) lymph nodes.</p>'),
(2,1,'<p>EBRT delivers radiation from a machine outside the body. '),
(3,1,'<p>Following lumpectomy radiotherapy <ul><li>Heading</li></ul></p>'),

理想情况下，我会在它们不存在的地方附加一个，例如在插入语句#2 中。

score 1 · Accepted Answer

如果你使用这个：

($\d+,\d+,'.*?)()?('$,)

您将获得对以下部分的反向引用：

(1,1,'Radiotherapy uses a beam of high-energy rays (or particles) lymph nodes.<-- 即序言和正文，包括开始的 P 标签
<-- 可选的结束 P 标记.. 即您可能无法匹配 2。
'),<-- 结束引号和括号，以及尾随逗号

然后，您可以将其替换为：

$1$3（例如使用 .NET 样式的反向引用）。

即，使用每个反向引用重建字符串，并使用显式关闭 P 标记，无论是否找到一个标记。

在不了解您的平台的情况下，我无法为此提供正确的正则表达式替换语法。

在 .NET 中它将是：

string input = @"INSERT INTO `help` VALUES 
(1,1,'<p>Radiotherapy uses a beam of high&#45;energy rays (or particles) lymph nodes.</p>'),
(2,1,'<p>EBRT delivers radiation from a machine outside the body. '),
(3,1,'<p>Following lumpectomy radiotherapy <ul><li>Heading</li></ul></p>'),";

Regex r = new Regex(@"(\(\d+,\d+,'<p>.*?)(</p>)?('\),)");
string output = r.Replace(input, "$1</p>$3");

Console.Write(output);

产生这个输出：

INSERT INTO `help` VALUES
(1,1,'<p>Radiotherapy uses a beam of high&#45;energy rays (or particles) lymph nodes.</p>'),
(2,1,'<p>EBRT delivers radiation from a machine outside the body. </p>'),
(3,1,'<p>Following lumpectomy radiotherapy <ul><li>Heading</li></ul></p>'),

score 1 · Accepted Answer

如果您可以确定始终跟在引号后面，'则以下在 Perl 中有效（没有 notepad++ ）

/<p> [^\n]* (?<! <\/p> )  (?=') /gx

（/x 允许使用空格，以便清楚起见）。这是在对报价进行前瞻的基础上进行消极的后视。

regex - 正则表达式：找到一个模式，然后是另一个有间隙的模式

2 回答 2

Related

Reference