我正在尝试从转换为 html 的 word 文档中的标题标签中提取所有数据(通过 word)
<(?<Class>h[5|6|7|8])>(?<ListIdentifier>.*?)<span style='font:7.0pt "Times New Roman"'>(?: )+.+</span>(?<Text>.*?)(?:</h[5|6|7|8]>)?
<h5>(1)<span style='font:7.0pt "Times New Roman"'>
</span>The Scheme (planning scheme) has been
prepared in accordance with the <i>asdf </i>(the Act)
as a framework for managing development in a way that advances the purpose of
the Act.</h5>
<h5>(2)<span style='font:7.0pt "Times New Roman"'>
</span>In seeking to achieve this purpose, the planning scheme sets out
the future development in the
planning scheme area over the next 20 years.</h5>
<h5>(3)<span style='font:7.0pt "Times New Roman"'>
</span>While the planning scheme has been prepared with a 20 year horizon, it
will be reviewed periodically in accordance with the Act to ensure that it
responds appropriately to the changes of the community at Local, Regional and State
正则表达式似乎可以工作,但是它将从第一个 h5 到最后一个或任何其他 h6|7|8 捕获。
我不想在这里对数据做任何复杂的事情,只需要一个简单的提取,所以我想坚持使用正则表达式而不是使用 html 解析器,在我的示例中可以公平地说标题很好形成,即。hX 总是由 hX 而不是 hY 封闭,并且标题内部没有标题或任何类似的时髦东西。
我想添加 ? 到 (?:) 的末尾会使它变得不贪婪,所以它只会匹配第一个实例,而不是尽可能多的匹配,我是否在这里遗漏了一些关于贪婪是如何工作的东西?
<(?<Class>h[5-8])>(?<ListIdentifier>.*?)<span style='font:7.0pt "Times New Roman"'>(?: )+.+?</span>(?<Text>.*?)(?:</h[5-8]>)
<h6> </h6>
<h6> </h6>
<h6> </h6>
<h6> </h6>
<h5>(1)<span style='font:7.0pt "Times New Roman"'>
</span>Short Title -The planning scheme policy may be cited as PSP No 2. –
Engineering Standards – Road and Drainage Infrastructure.</h5>
所以它包括整个文本,而我希望它忽略带有 nbsp 的 h6,因为它们没有跨度