ruby - 在使用 Nokogiri 解析 Wordpress XML 时搜索标签

Question

我有一个包含引号的 Wordpress 博客的 XML 文件：

<item>
  <title>Brothers Karamazov</title>
  <content:encoded><![CDATA["I think that if the Devil doesn't exist and, consequently, man has created him, he has created him in his own image and likeness."]]></content:encoded>
  <category domain="post_tag" nicename="dostoyevsky"><![CDATA[Dostoyevsky]]></category>
  <category domain="post_tag" nicename="humanity"><![CDATA[humanity]]></category>
  <category domain="category" nicename="quotes"><![CDATA[quotes]]></category>
  <category domain="post_tag" nicename="the-devil"><![CDATA[the Devil]]></category>
</item>

我要提取的内容是标题、作者、内容和标签。到目前为止，这是我的代码：

require "rubygems"
require "nokogiri"

doc = Nokogiri::XML(File.open("/Users/charliekim/Downloads/quotesfromtheunderground.wordpress.2013-04-14.xml"))

doc.css("item").each do |item|
  title   = item.at_css("title").text
  tag     = item.at_xpath("category").text
  content = item.at_xpath("content:encoded").text

  #each post will later be pushed to an array, but I'm not worried about that yet, so for now....
  puts "#{title} #{tag}"
end

我正在努力从每个item. 我得到了类似的回报Brothers Karamazov Dostoyevsky。我不担心它的格式，因为它只是一个测试，看看它是否正确地拾取东西。有人知道我该怎么做吗？

我还想制作大写 = Author 的标签，所以如果你知道怎么做，它也会有所帮助，尽管我还没有尝试过。

编辑：我将代码更改为：

doc.css("item").each do |item|
  title   = item.at_css("title").text
  content = item.at_xpath("content:encoded").text
  tag     = item.at_xpath("category").each do |category|
        category
  end

  puts "#{title}: #{tag}"
end

返回：

Brothers Karamazov: [#<Nokogiri::XML::Attr:0x80878518 name="domain" value="post_tag">,     #<Nokogiri::XML::Attr:0x80878504 name="nicename" value="dostoyevsky">]

这似乎更易于管理。它搞砸了我从大写标签中获取作者的计划，但是，这没什么大不了的。我怎么能只拉第二个value？

score 2 · Accepted Answer

当方法只返回第一个结果时，您正在使用at_xpath并期望它返回多个结果。at_

你想要这样的东西：

tags = item.xpath("category").map(&:text)

这将返回一个数组。

至于识别作者，您可以使用正则表达式选择以大写字母开头的项目：

author = tags.select{|w| w =~ /^[A-Z]/}

它将选择任何大写的标签。这使标签保持不变。如果您想将作者与标签分开，您可以使用partition：

author, tags = item.xpath("category").map(&:text).partition{|w| w =~ /^[A-Z]/}

请注意，在上述示例中，作者是一个数组，将包含所有匹配项（即多个大写标签）。

ruby - 在使用 Nokogiri 解析 Wordpress XML 时搜索标签

1 回答 1

Related

Reference