ruby - 如何将字符串切片为少于 500 个字符的块但不在 < > 括号内剪切？

Question

我想将一个字符串分成少于 500 个字符的块并将其放入一个数组中。这很容易解决。

我的问题是字符串包含 html 代码，并且拆分应该发生在 < > 括号之外。有谁知道该怎么做？

这就是我目前得到的。

while article.length > 0 do
  textarr << article[0, 499]
  article[0, 499] = ""
end

有人可以告诉我如何检查拆分没有切入 html 代码吗？谢谢

score 3 · Accepted Answer

textarr = article.scan(/.{1,500}(?![^<>]*>)/m)

将您的字符串拆分为最多 500 个字符（尽可能多）的块，必要时减小块的大小，以确保下一个尖括号不是右括号。

score 1 · Accepted Answer

假设您有格式良好的 HTML（没有未编码的<外部标签），您可以通过查找不匹配的<. 您可以使用正则表达式：

"Lorem <b>ipsum</b" =~ /<[^>]*\Z/
# => 14

"Lorem <b>ipsum</b>" =~ /<[^>]*\Z/
# => nil

为了修改您的拆分以使其不剪切标签，您可以使用此正则表达式来获取可变长度的块（注意=~返回匹配发生的索引，如果没有匹配则返回 nil）：

def chunk_length(chunk)
  chunk =~ /<[^>]*\Z/ || chunk.length
end

textarr = []
start = 0
while start < article.length
  length = chunk_length(article[start, 499])
  # probably should check for length == 0 here in case you get a really long tag!
  textarr << article[start, length]
  start += length
end

length == 0如果标签很长，则可能需要检查；假设你有一些病态的东西

<div class="lots of classes" style="some: 'raw css';" data-attribute="more stuff" ...

本身可能超过 500 个字符。然后，您将到达以article[start, 499]开头<但不包含结尾的点>，因此=~返回 0（因为它在字符串的开头匹配），您将陷入无限循环。

ruby - 如何将字符串切片为少于 500 个字符的块但不在 < > 括号内剪切？

2 回答 2

Related

Reference