html - .net 正则表达式 html 标题

Question

我正在尝试从转换为 html 的 word 文档中的标题标签中提取所有数据（通过 word）

我有以下正则表达式：

<(?<Class>h[5|6|7|8])>(?<ListIdentifier>.*?)<span style='font:7.0pt "Times New Roman"'>(?:&nbsp;)+.+</span>(?<Text>.*?)(?:</h[5|6|7|8]>)?

我的源文本如下

<h5>(1)<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span>The Scheme (planning scheme) has been
prepared in accordance with the <i>asdf </i>(the Act)
as a framework for managing development in a way that advances the purpose of
the Act.</h5>

<h5>(2)<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span>In seeking to achieve this purpose, the planning scheme sets out
the future development in the
planning scheme area over the next 20 years.</h5>

<h5>(3)<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span>While the planning scheme has been prepared with a 20 year horizon, it
will be reviewed periodically in accordance with the Act to ensure that it
responds appropriately to the changes of the community at Local, Regional and State
levels.</h5>

正则表达式似乎可以工作，但是它将从第一个 h5 到最后一个或任何其他 h6|7|8 捕获。

我不想在这里对数据做任何复杂的事情，只需要一个简单的提取，所以我想坚持使用正则表达式而不是使用 html 解析器，在我的示例中可以公平地说标题很好形成，即。hX 总是由 hX 而不是 hY 封闭，并且标题内部没有标题或任何类似的时髦东西。

我想添加 ? 到 (?:) 的末尾会使它变得不贪婪，所以它只会匹配第一个实例，而不是尽可能多的匹配，我是否在这里遗漏了一些关于贪婪是如何工作的东西？

编辑：

正则表达式

<(?<Class>h[5-8])>(?<ListIdentifier>.*?)<span style='font:7.0pt "Times New Roman"'>(?:&nbsp;)+.+?</span>(?<Text>.*?)(?:</h[5-8]>)

似乎也匹配

<h6>&nbsp;</h6>

<h6>&nbsp;</h6>

<h6>&nbsp;</h6>

<h6>&nbsp;</h6>

<h5>(1)<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span>Short Title -The planning scheme policy may be cited as PSP No 2. –
Engineering Standards – Road and Drainage Infrastructure.</h5>

所以它包括整个文本，而我希望它忽略带有 nbsp 的 h6，因为它们没有跨度

score 2 · Accepted Answer

正则表达式中间有一个贪婪.+导致问题（就在之前</span>）。将其更改为.+?，您的正则表达式应该可以正常工作。

请注意，您的字符类应该[5678]代替[5|6|7|8]（隐含字符之间的 OR），甚至可以缩短为[5-8].

您还应该?从末尾删除尾随，(?:</h[5-8]>)?应该是(?:</h[5-8]>). 如果没有此更改，您的比赛将提前结束。

编辑：当前正则表达式与您在编辑中输入的文本匹配的原因是，如果在它之前没有看到 span 和 nbsp .*?，则 ListIdentifier 组中的将匹配 a 。</hX>您应该能够通过将其更改为来解决此问题.*?，[^<]*它不会匹配任何小于符号，因此它需要存在跨度。

结果：

<(?<Class>h[5-8])>(?<ListIdentifier>[^<]*)<span style='font:7.0pt "Times New Roman"'>(?:&nbsp;)+.+?</span>(?<Text>.*?)(?:</h[5-8]>)

html - .net 正则表达式 html 标题

1 回答 1

Related

Reference