linkedin - 使用 nutch 抓取 twitter，linkedin

Question

我一直在尝试使用nutch 来抓取twitter 和linkedin 数据Nutch-0.9。

但是，当我尝试抓取 twitter 时，regex-filter 似乎不起作用，我的 regex-filter 文件有 +^https://([a-z0-9]*.) twitter.com/a 和我想要做的就是只抓取那些遵循上述模式的网址。我最终得到了诸如https://twitter.com/document之类的网址。
至于linkedin部分，每当我尝试抓取它时它总是显示超时，linkedin上的robots.txt说你需要邮寄才能让你的抓取工具列入白名单，但他们从不回复。

感谢你的帮助！

score 0 · Accepted Answer

As I know so far, Nutch did not support crawling Twitter and Linkedin data. For crawling Titter data you should using Twitter API, check this one http://twitter4j.org/en/. For crawling Linked data, you could have a look on this https://github.com/pondering/scrapy-linkedin.

Hope this helps

score 0 · Accepted Answer

如果你想抓取这个特定的网址，你也应该包括以下行

-.*

此命令将排除所有其他网址！此外，如果您想抓取 twitter 或 linkedin，您可以使用指定的爬虫，例如twit4j或linkedin-j！

linkedin - 使用 nutch 抓取 twitter，linkedin

2 回答 2

Related

Reference