肥皂盒
我们可以制作一个正则表达式来匹配您的特定情况,但鉴于这是 HTML 解析,并且您的用例暗示其中可能包含任意数量的标签,您最好使用 DOM 或使用HTML Agility 之类的产品(自由)
然而
如果您只是想提取内部文本并且对保留任何标记数据不感兴趣,则可以使用此正则表达式并将所有匹配项替换为 null
(<[^>]*>)
保留句子,包括子标签
((?:<p(?:\s[^>]*)?>).*?</p>)
- 保留段落标签和整个句子,但不保留段落之外的任何数据
(?:<p(?:\s[^>]*)?>)(.*?)(?:</p>)
- 仅保留包含所有子标签的段落内部文本,并将句子存储到第 1 组
(<p(?:\s[^>]*)?>)(.*?)(</p>)
- 捕获打开和关闭段落标签和包含任何子标签的内文
假设这些是 PowerShell 示例,regex 和 replace 函数应该是相似的
$string = '<img> not this stuff either</img><p class=SuperCoolStuff>This is a sample of a <a href="#">link</a> getting chewed up.</p><a> other stuff</a>'
Write-Host "replace p tags with a new span tag"
$string -replace '(?:<p(?:\s[^>]*)?>)(.*?)(?:</p>)', '<span class=sentence>$1</span>'
Write-Host
Write-Host "insert p tag's inner text into a span new span tag and return the entire thing including the p tags"
$string -replace '(<p(?:\s[^>]*)?>)(.*?)(</p>)', '$1<span class=sentence>$2</span>$3'
产量
replace p tags with a new span tag
<img> not this stuff either</img><span class=sentence>This is a sample of a <a href="#">link</a> getting chewed up.</span
><a> other stuff</a>
insert p tag's inner text into a span new span tag and return the entire thing including the p tags
<img> not this stuff either</img><p class=SuperCoolStuff><span class=sentence>This is a sample of a <a href="#">link</a>
getting chewed up.</span></p><a> other stuff</a>