ruby - Ruby Anemone 蜘蛛为访问的每个 url 添加标签

Question

我有一个爬网设置：

require 'anemone'

Anemone.crawl("http://www.website.co.uk", :depth_limit => 1) do |anemone|
anemone.on_every_page do |page|
  puts page.url
end
end

但是，我希望蜘蛛在它访问的每个 URL 上使用谷歌分析反跟踪标签，而不必实际点击链接。

我可以使用一次蜘蛛并存储所有 URL 并使用WATIR运行它们添加标签，但我想避免这种情况，因为它很慢而且我喜欢 skip_links_like 和页面深度功能。

我怎么能实现这个？

score 3 · Accepted Answer

您想在加载之前向 URL 添加一些内容，对吗？你可以使用focus_crawl它。

Anemone.crawl("http://www.website.co.uk", :depth_limit => 1) do |anemone|
    anemone.focus_crawl do |page|
        page.links.map do |url|
            # url will be a URI (probably URI::HTTP) so adjust
            # url.query as needed here and then return url from
            # the block.
            url
        end
    end
    anemone.on_every_page do |page|
        puts page.url
    end
end

focus_crawl用于过滤 URL 列表的方法：

指定一个块，该块将选择每个页面上要遵循的链接。该块应返回一个 URI 对象数组。

但您也可以将其用作通用 URL 过滤器。

例如，如果您想添加atm_source=SiteCon&atm_medium=Mycampaign到所有链接，那么您page.links.map将如下所示：

page.links.map do |uri|
    # Grab the query string, break it into components, throw out
    # any existing atm_source or atm_medium components. The to_s
    # does nothing if there is a query string but turns a nil into
    # an empty string to avoid some conditional logic.
    q = uri.query.to_s.split('&').reject { |x| x =~ /^atm_(source|medium)=/ }

    # Add the atm_source and atm_medium that you want.
    q << 'atm_source=SiteCon' << 'atm_medium=Mycampaign'

    # Rebuild the query string 
    uri.query = q.join('&')

    # And return the updated URI from the block
    uri
end

如果您是atm_source或atm_medium包含非 URL 安全字符，则对它们进行 URI 编码。

ruby - Ruby Anemone 蜘蛛为访问的每个 url 添加标签

1 回答 1

Related

Reference