web-crawler - 禁用 robots.txt 签入 nutch

Question

我想禁用 robots.txt 检查Nutch并从网站上抓取所有内容。 Disable意味着在获取或解析任何网站之前，跳过检查robot.txt。这可能吗？

score 2 · Accepted Answer

虽然这个问题很老，但我个人觉得它仍然值得回答。

是的，可以禁用robots.txt流（但您需要更改和构建 Nutch 源代码）。

注意：nutch 没有提供任何特定的配置来在获取实际 URL 之前禁用 robots.txt 获取。因为您所说的听起来像是 URL/域滥用，并且您想要访问它，而不管网站试图通过 robots.txt 说明其资源的内容。

怎么可能？ 如果您有一个真正需要跳过 robots.txt 的自定义用例，那么您可以执行以下操作

nutch 中的大部分插件(protocal-(http|httpclient|selenium|okhttp)) 使用HttpRobotRulesParser类来获取和解析 robots.txt 的内容

在HttpRobotsRulesParser这个特定方法中，您可以解析并返回规则对象。

public BaseRobotRules getRobotRulesSet(Protocol http, URL url,
      List<Content> robotsTxtContent) {

    if (LOG.isTraceEnabled() && isWhiteListed(url)) {
      LOG.trace("Ignoring robots.txt (host is whitelisted) for URL: {}", url);
    }

    String cacheKey = getCacheKey(url);
    BaseRobotRules robotRules = CACHE.get(cacheKey);

    if (robotRules != null) {
      return robotRules; // cached rule
    } else if (LOG.isTraceEnabled()) {
      LOG.trace("cache miss " + url);
    }

    boolean cacheRule = true;
    URL redir = null;

    if (isWhiteListed(url)) {
      // check in advance whether a host is whitelisted
      // (we do not need to fetch robots.txt)
      robotRules = EMPTY_RULES;
      LOG.info("Whitelisted host found for: {}", url);
      LOG.info("Ignoring robots.txt for all URLs from whitelisted host: {}",
          url.getHost());

    } else {
      try {
        URL robotsUrl = new URL(url, "/robots.txt");
        Response response = ((HttpBase) http).getResponse(robotsUrl,
            new CrawlDatum(), false);
        if (robotsTxtContent != null) {
          addRobotsContent(robotsTxtContent, robotsUrl, response);
        }
        // try one level of redirection ?
        if (response.getCode() == 301 || response.getCode() == 302) {
          String redirection = response.getHeader("Location");
          if (redirection == null) {
            // some versions of MS IIS are known to mangle this header
            redirection = response.getHeader("location");
          }
          if (redirection != null) {
            if (!redirection.startsWith("http")) {
              // RFC says it should be absolute, but apparently it isn't
              redir = new URL(url, redirection);
            } else {
              redir = new URL(redirection);
            }

            response = ((HttpBase) http).getResponse(redir, new CrawlDatum(), false);
            if (robotsTxtContent != null) {
              addRobotsContent(robotsTxtContent, redir, response);
            }
          }
        }

        if (response.getCode() == 200) // found rules: parse them
          robotRules = parseRules(url.toString(), response.getContent(),
              response.getHeader("Content-Type"), agentNames);

        else if ((response.getCode() == 403) && (!allowForbidden))
          robotRules = FORBID_ALL_RULES; // use forbid all
        else if (response.getCode() >= 500) {
          //cacheRule = false; // try again later to fetch robots.txt
          robotRules = EMPTY_RULES;
        } else
          robotRules = EMPTY_RULES; // use default rules
      } catch (Throwable t) {
        if (LOG.isInfoEnabled()) {
          LOG.info("Couldn't get robots.txt for " + url + ": " + t.toString());
        }
        //cacheRule = false; // try again later to fetch robots.txt
        robotRules = EMPTY_RULES;
      }
    }

    if (cacheRule) {
      CACHE.put(cacheKey, robotRules); // cache rules for host
      if (redir != null && !redir.getHost().equalsIgnoreCase(url.getHost())
          && "/robots.txt".equals(redir.getFile())) {
        // cache also for the redirected host
        // if the URL path is /robots.txt
        CACHE.put(getCacheKey(redir), robotRules);
      }
    }

    return robotRules;
  }

你可以继续用下面的方法替换

  @Override
  public BaseRobotRules getRobotRulesSet(Protocol http, URL url,
      List<Content> robotsTxtContent) {
       return EMPTY_RULES; // always return empty rules to skip robots.txt access.
    }

您只是在通过返回 EMPTY_RULES 来嘲笑这种行为。

小提示：始终建议阅读和访问 robots.txt 中提到的资源。

score 1 · Accepted Answer

1

据我了解，我们无法在 nutch 中禁用 robots.txt。

于 2013-03-01T21:28:25.510 回答

web-crawler - 禁用 robots.txt 签入 nutch

2 回答 2

Related

Reference