ruby - Ruby+Anemone Web Crawler：正则表达式匹配以一系列数字结尾的 URL

Question

假设我正在尝试抓取一个网站并跳过一个这样结束的页面：

http://HIDDENWEBSITE.com/anonimize/index.php?page=press_and_news&subpage=20060117

我目前在 Ruby 中使用 Anemone gem 来构建爬虫。我正在使用 skip_links_like 方法，但我的模式似乎永远不会匹配。我试图使其尽可能通用，因此它不依赖于子页面，而仅依赖于=2105925（数字）。

我已经尝试过/=\d+$/，/\?.*\d+$/但它似乎没有工作。

这类似于Skipping web-pages with extension pdf, zip from crawling in Anemone但我不能用数字而不是扩展名来实现它。

此外，使用该模式在http://regexpal.com/=\d+$上进行测试将成功匹配http://misc.com/test/index.php?page=news&subpage=20060118

编辑：

这是我的全部代码。我想知道是否有人可以确切地看到问题所在。

require 'anemone'
...
Anemone.crawl(url, :depth_limit => 3, :obey_robots_txt => true) do |anemone|
  anemone.skip_links_like /\?.*\d+$/
  anemone.on_every_page do |page|
    pURL = page.url.to_s
    puts "Now checking: " + pURL
    bestGuess[pURL] = match_freq( manList, page.doc.inner_text )
    puts "Successfully checked"
  end
end

我的输出是这样的：

...
Now checking: http://MISC.com/about_us/index.php?page=press_and_news&subpage=20110711
Successfully checked
...

score 3 · Accepted Answer

  Anemone.crawl(url, :depth_limit => 3, :obey_robots_txt => true, :skip_query_strings => true) do |anemone|
   anemone.on_every_page do |page|
     pURL = page.url.to_s
     puts "Now checking: " + pURL
      bestGuess[pURL] = match_freq( manList, page.doc.inner_text )
     puts "Successfully checked"
   end
 end

score 2 · Accepted Answer

其实/\?.*\d+$/作品：

~> irb
> all systems are go wirble/hirb/ap/show <
ruby-1.9.2-p180 :001 > "http://hiddenwebsite.com/anonimize/index.php?page=press_and_news&subpage=20060117".match /\?.*\d+$/
 => #<MatchData "?page=press_and_news&subpage=20060117">

ruby - Ruby+Anemone Web Crawler：正则表达式匹配以一系列数字结尾的 URL

2 回答 2

Related

Reference