我正在尝试使用 eventmachine 和 em-synchrony 编写解析器(解析邮政编码的街道和房屋)。问题是我要解析的网站具有嵌套结构——每个邮政编码都有许多街道页面,这些页面具有分页。所以算法很简单:
- 对于每个邮政编码
- 访问邮政编码索引页面
- 解析索引页面
- 解析分页
- 对于每个分页页面解析此页面
- 访问邮政编码索引页面
这是一个这样的解析器的例子(它有效):
require "nokogiri"
require "em-synchrony"
require "em-synchrony/em-http"
def url page = nil
url = "http://gistflow.com/all"
url << "?page=#{page}" if page
url
end
EM.synchrony do
concurrency = 2
# here [1] is array of index pages, for this template let it be just [1]
results = EM::Synchrony::Iterator.new([1], concurrency).map do |index, iter|
index_page = EM::HttpRequest.new(url).aget
index_page.callback do
# here we make some parsing and find out wheter index page
# has pagination. The worst case is that it has pagination
pages = [2,3,4,5]
unless pages.empty?
# here we need to parse all pages
# with urls like url(page)
# how can I do it more efficiently?
end
iter.return "SUCC #{index}"
end
index_page.errback do
iter.return "ERR #{index}"
end
end
p results
EM.stop
end
所以诀窍就在这个块里面:
unless pages.empty?
# here we need to parse all pages
# with urls like url(page)
# how can I do it more efficiently?
end
如何在同步迭代器循环中实现嵌套的 EM HTTP 调用?
我尝试了不同的方法,但每次我遇到诸如“无法从根光纤产生”或调用 errback 块之类的错误时。