ruby-on-rails - 正则表达式仅从字符串中删除开始和结束 html 标记？

Question

<div><p>例如，我想</p></div>从下面的字符串中删除。正则表达式应该能够从字符串的开头和结尾删除任意数量的标签。

<div><p>text to <span class="test">test</span> the selection on.
Kibology for <b>all</b><br>. All <i>for</i> Kibology.</p></div>

我一直在修补 rubular.com，但没有成功。谢谢！

score 1 · Accepted Answer

 def remove_html_end_tags(html_str)
   html_str.match(/\<(.+)\>(?!\W*\<)(.+)\<\/\1\>/m)[2]
 end

我没有看到 Alan Moore 在下面指出的\<(.+)>消耗多个开始标签的问题，这很奇怪，因为我同意这是不正确的。它应该更改为\<([^>\<]+)>或类似于消除歧义的内容。

 def remove_html_end_tags(html_str)
    html_str.match(/\<([^\>\<]+)\>(?!\W*?\<)(.+)\<\/\1\>/m)[2]
 end

这个想法是您想要捕获在遇到的第一个标签的打开/关闭之间没有立即被另一个标签跟随的所有内容，即使之间有空格。

因为我不知道如何（积极向前看）说给我第一个键，它的右尖括号后跟至少一个单词字符，然后是下一个左尖括号，我说

\>(?!\W*\<)

在下一个左尖括号之前找到不包含所有非单词字符的右尖括号。

一旦你确定了具有该属性的键，找到它的关闭伙伴并返回它们之间的东西。

这是另一种方法。查找向前扫描的标签并删除第一个 n。会因相同类型的嵌套标签而崩溃，但我不会将这种方法用于任何实际工作。

def remove_first_n_html_tags(html_str, skip_count=0)
  matches = []
  tags = html_str.scan(/\<([\w\s\_\-\d\"\'\=]+)\>/).flatten  
  tags.each do |tag|
   close_tag = "\/%s" % tag.split(/\s+/).first
   match_str = "<#{tag}>(.+)<#{close_tag}>"
   match = html_str.match(/#{match_str}/m) 
   matches << match if match
 end
 matches[skip_count]

结尾

score 0 · Accepted Answer

还是涉及到一些编程：

str = '<div><p>text to <span class="test">test</span> the selection on.
Kibology for <b>all</b><br>. All <i>for</i> Kibology.</p></div>'

while (m = /\A<.+?>/.match(str)) && str.end_with?('</' + m[0][1..-1])
  str = str[m[0].size..-(m[0].size + 2)]
end

邪神你在外面吗？

score -1 · Accepted Answer

I am going to go ahead and answer my own question. Below is the programmatic route:

The input string goes into the first loop as an array in order to remove the front tags. The resulting string is looped through in reverse order in order to remove the end tags. The string is then reversed in order to put it in the correct order.

def remove_html_end_tags(html_str)

 str_no_start_tag = ''
 str_no_start_and_end_tag = ''

  a = html_str.split("")

     i= 0 
     is_text = false
     while i <= (a.length - 1)
       if (a[i] == '<') && !is_text
         while (a[i] != '>')
           i+= 1
         end 
          i+=1
       else
         is_text = true
          str_no_start_tag << a[i] 
         i+=1
       end
     end

    a = str_no_start_tag.split("")

    i= a.length - 1 
    is_text = false
    while i >= 0
      if (a[i] == '>') && !is_text
        while (a[i] != '<')
           i-= 1
        end 
        i-=1
      else
        is_text = true
        str_no_start_and_end_tag << a[i] 
        i-=1
      end
   end 

  str_no_start_and_end_tag.reverse!

 end

score -1 · Accepted Answer

(?:\<div.*?\>\<p.*?\>)|(?:\<\/p\>\<\/div\>)是你需要的表达方式。但这并不能检查每种情况......如果您尝试解析任何可能的标签组合，您可能需要查看其他解析方式。

例如，这个表达式不允许 div 和 p 标签之间有任何空格。因此，如果您想允许这样做，您可以在标签\s*的各个\>\<部分之间添加，如下所示(?:\<div.*?\>\s*\<p.*?\>)|(?:\<\/p\>\s*\<\/div\>)：

div 标签和 p 标签应该是小写的，因为表达式是写的。因此，您可能想找出一种方法来检查每个字母的大写或小写字母，以便也可以找到 Div 或 dIV。

使用gskinner 的 RegEx 工具来测试和学习正则表达式。

所以你最终的 ruby 代码应该是这样的：

# Ruby sample for showing the use of regular expressions

str = "<div><p>text to <span class=\"test\">test</span> the selection on.
Kibology for <b>all</b><br>. All <i>for</i> Kibology.</p></div>"

puts 'Before Reguar Expression: "', str, '"'

str.gsub!(/(?:\<div.*?\>\s*\<p.*?\>)|(?:\<\/p\>\s*\<\/div\>)/, "")

puts 'After Regular Expression', str

system("pause")

编辑：div*?根据评论中的建议替换div.*?和替换。编辑：这个答案不允许任何标签集，只是问题第一行中列出的两个标签。p*?p.*?

ruby-on-rails - 正则表达式仅从字符串中删除开始和结束 html 标记？

4 回答 4

Related

Reference