html - ruby 解析的问题

Question

我只是在用红宝石中的 nokogiri 解析网站时遇到了一个小问题。

这是网站的样子

<div id="post_message_111112" class="postcontent">

        Hee is text 1 
     here is another
      </div>
<div id="post_message_111111" class="postcontent">

            Here is text 2
    </div>

这是我的代码来解析它

 doc = Nokogiri::HTML(open(myNewLink))
 myPost = doc.xpath("//div[@class='postcontent']/text()").to_a()

ii=0

 while ii!=myPost.length
     puts "#{ii}  #{myPost[ii].to_s().strip}"
   ii+=1
 end

我的问题是当它显示它时，由于之后的新行Hee is text 1， to_a 让它变得很奇怪

myPost[0] = hee is text 1
myPost[1] = here is another
myPost[2] = here is text 2

我希望每个 div 都是自己的消息。像

myPost[0] = hee is text 1 here is another
myPost[1] = here is text 2

我将如何解决这个谢谢

更新

我试过

 myPost = doc.xpath("//div[@class='postcontent']/text()").to_a()

myPost.each_with_index do |post, index|
  puts "#{index}  #{post.to_s().gsub(/\n/, ' ').strip}"
end

我放 post.to_s().gsub 是因为它抱怨 gsub 不是一种发布方法。但我仍然有同样的问题。我知道我做错了只是破坏了我的头

更新 2

忘了说新线是<br />甚至连

   doc.search('br').each do |n|
  n.replace('')
end

或者

doc.search('br').remove

问题依然存在

score 0 · Accepted Answer

如果您查看myPost数组，您会发现每个 div 实际上是它自己的消息。第一个恰好包含一个换行符\n。要用空格替换它，请使用#gsub(/\n/, ' '). 所以你的循环看起来像这样：

myPost.each_with_index do |post, index|
    puts "#{index}  #{post.to_s.gsub(/\n/, ' ').strip}"
end

编辑：

根据我对它的有限了解，xpath 只能找到节点。子节点是<br />，因此它们之间有多个文本，或者您div的搜索中包含标签。肯定有一种方法可以在<br />节点之间加入文本，但我不知道。在你找到它之前，这里有一些有用的东西：

将您的 xpath 匹配替换为"//div[@class='postcontent']"

调整循环以删除 div 标签：

myPost.each_with_index do |post, index|
     post = post.to_s
     post.gsub!(/\n/, ' ')
     post.gsub!(/^<div[^>]*>/, '') # delete opening div tag
     post.gsub!(%r|</\s*div[^>]*>|, '') # delete closing div tag
     puts "#{index}  #{post.strip}"
end

score 0 · Accepted Answer

在这里，让我为您清理：

doc.search('div.postcontent').each_with_index do |div, i|
  puts "#{i} #{div.text.gsub(/\s+/, ' ').strip}"
end
# 0 Hee is text 1 here is another
# 1 Here is text 2

html - ruby 解析的问题

2 回答 2

Related

Reference