我正在尝试从转换为 html 的 word 文档中的标题标签中提取所有数据(通过 word)
我有以下正则表达式:
<(?<Class>h[5|6|7|8])>(?<ListIdentifier>.*?)<span style='font:7.0pt "Times New Roman"'>(?: )+.+</span>(?<Text>.*?)(?:</h[5|6|7|8]>)?
我的源文本如下
<h5>(1)<span style='font:7.0pt "Times New Roman"'>
</span>The Scheme (planning scheme) has been
prepared in accordance with the <i>asdf </i>(the Act)
as a framework for managing development in a way that advances the purpose of
the Act.</h5>
<h5>(2)<span style='font:7.0pt "Times New Roman"'>
</span>In seeking to achieve this purpose, the planning scheme sets out
the future development in the
planning scheme area over the next 20 years.</h5>
<h5>(3)<span style='font:7.0pt "Times New Roman"'>
</span>While the planning scheme has been prepared with a 20 year horizon, it
will be reviewed periodically in accordance with the Act to ensure that it
responds appropriately to the changes of the community at Local, Regional and State
levels.</h5>
正则表达式似乎可以工作,但是它将从第一个 h5 到最后一个或任何其他 h6|7|8 捕获。
我不想在这里对数据做任何复杂的事情,只需要一个简单的提取,所以我想坚持使用正则表达式而不是使用 html 解析器,在我的示例中可以公平地说标题很好形成,即。hX 总是由 hX 而不是 hY 封闭,并且标题内部没有标题或任何类似的时髦东西。
我想添加 ? 到 (?:) 的末尾会使它变得不贪婪,所以它只会匹配第一个实例,而不是尽可能多的匹配,我是否在这里遗漏了一些关于贪婪是如何工作的东西?
编辑:
正则表达式
<(?<Class>h[5-8])>(?<ListIdentifier>.*?)<span style='font:7.0pt "Times New Roman"'>(?: )+.+?</span>(?<Text>.*?)(?:</h[5-8]>)
似乎也匹配
<h6> </h6>
<h6> </h6>
<h6> </h6>
<h6> </h6>
<h5>(1)<span style='font:7.0pt "Times New Roman"'>
</span>Short Title -The planning scheme policy may be cited as PSP No 2. –
Engineering Standards – Road and Drainage Infrastructure.</h5>
所以它包括整个文本,而我希望它忽略带有 nbsp 的 h6,因为它们没有跨度