0

I am trying to parse a nytimes rss feed with Ruby's parser.

nyt_url = 'http://www.nytimes.com/services/xml/rss/nyt/World.xml'
open(nyt_url) do |rss|
  @nyt_feed = RSS::Parser.parse(rss)
end

And in the view file:

<h2>New York Times Feed</h2>
<% @nyt_feed.items.each do |item| %>
  <p>
    <%= link_to item.title, item.link %>
    <%= item.description %>
  </p>
<% end %>

But what I get out for the description looks something like this:

    Since air assaults by the Assad government picked up two weeks ago, 
knocking rebels in the south on their heels, Syrians have been arriving
at refuge camps in Jordan at a rate of about 2,000 a night.<img width='1' height='1' 
src='http://rss.nytimes.com/c/34625/f/642565/s/22f90a36/mf.gif' border='0'/><br/><br/><a 
href="http://da.feedsportal.com/r/139263791500/u/0/f/642565/c/34625/s/22f90a36/a2.htm"><img 
src="http://da.feedsportal.com/r/139263791500/u/0/f/642565/c/34625/s/22f90a36/a2.img" 
border="0"/></a><img width="1" height="1" 
src="http://pi.feedsportal.com/r/139263791500/u/0/f/642565/c/34625/s/22f90a36/a2t.img" 
border="0"/>

I also have a similar situation with the Washington Post feed. How do I get the images to actually display, or at least to get just the description part. Do I have to handle this with regular expressions or is there some method on the parser object that I should be using?

4

1 回答 1

0

仅使用正则表达式来解析 XML 或 RSS(或 HTML)并不是一个好主意,因为预测所有可能的标签嵌套并不容易。

通常你会想要使用 XML Gem / library 来解析你的 RSS 或 XML 数据(如 libxml、Nokogiri、Ox),但是当 XML 提要非常大时,它会占用大量内存

试试 Ox 或 Nokogiri,看看它是否比正则表达式更适合你。

如果您的提要非常大,并且其中有很多文章,您可以尝试使用正则表达式剪切项目/文章,然后使用 Ox 或 Nokogiri 分别解析它们的内容(这在完成时也很好用用于并行处理的 Resque 作业)。

于 2013-04-16T22:03:30.660 回答