javascript - 将 HTML 内容拆分为句子，但保持子标签完整

Question

我正在使用下面的代码将段落标签中的所有文本分成句子。除了少数例外，它工作正常。但是，段落中的标签会被咀嚼并吐出。例子：

<p>This is a sample of a <a href="#">link</a> getting chewed up.</p>

那么，我怎样才能忽略标签，以便我可以解析句子并在它们周围放置跨度标签并将 , 等...标签保留在适当的位置？或者以某种方式遍历 DOM 并这样做是否更聪明？

// Split text on page into clickable sentences
$('p').each(function() {
    var sentences = $(this)
        .text()
        .replace(/(((?![.!?]['"]?\s).)*[.!?]['"]?)(\s|$)/g, 
                 '<span class="sentence">$1</span>$3');
    $(this).html(sentences);
});

我在 Chrome 扩展内容脚本中使用它；这意味着 javascript 被注入到它所接触的任何页面中并动态解析<p>标签。因此，它需要是javascript。

score 0 · Accepted Answer

肥皂盒

我们可以制作一个正则表达式来匹配您的特定情况，但鉴于这是 HTML 解析，并且您的用例暗示其中可能包含任意数量的标签，您最好使用 DOM 或使用HTML Agility 之类的产品（自由）

然而

如果您只是想提取内部文本并且对保留任何标记数据不感兴趣，则可以使用此正则表达式并将所有匹配项替换为 null

(<[^>]*>)

在此处输入图像描述

保留句子，包括子标签

((?:<p(?:\s[^>]*)?>).*?</p>)- 保留段落标签和整个句子，但不保留段落之外的任何数据
(?:<p(?:\s[^>]*)?>)(.*?)(?:</p>)- 仅保留包含所有子标签的段落内部文本，并将句子存储到第 1 组
(<p(?:\s[^>]*)?>)(.*?)(</p>)- 捕获打开和关闭段落标签和包含任何子标签的内文

假设这些是 PowerShell 示例，regex 和 replace 函数应该是相似的

$string = '<img> not this stuff either</img><p class=SuperCoolStuff>This is a sample of a <a href="#">link</a> getting chewed up.</p><a> other stuff</a>'

Write-Host "replace p tags with a new span tag"
$string -replace '(?:<p(?:\s[^>]*)?>)(.*?)(?:</p>)', '<span class=sentence>$1</span>'

Write-Host
Write-Host "insert p tag's inner text into a span new span tag and return the entire thing including the p tags"
$string -replace '(<p(?:\s[^>]*)?>)(.*?)(</p>)', '$1<span class=sentence>$2</span>$3'

产量

replace p tags with a new span tag
<img> not this stuff either</img><span class=sentence>This is a sample of a <a href="#">link</a> getting chewed up.</span
><a> other stuff</a>

insert p tag's inner text into a span new span tag and return the entire thing including the p tags
<img> not this stuff either</img><p class=SuperCoolStuff><span class=sentence>This is a sample of a <a href="#">link</a> 
getting chewed up.</span></p><a> other stuff</a>

javascript - 将 HTML 内容拆分为句子，但保持子标签完整

1 回答 1

肥皂盒

然而

保留句子，包括子标签

Related

Reference