我想禁用 robots.txt 检查Nutch并从网站上抓取所有内容。
Disable
意味着在获取或解析任何网站之前,跳过检查robot.txt。这可能吗?
问问题
1675 次
2 回答
2
虽然这个问题很老,但我个人觉得它仍然值得回答。
是的,可以禁用robots.txt流(但您需要更改和构建 Nutch 源代码)。
注意:nutch 没有提供任何特定的配置来在获取实际 URL 之前禁用 robots.txt 获取。因为您所说的听起来像是 URL/域滥用,并且您想要访问它,而不管网站试图通过 robots.txt 说明其资源的内容。
怎么可能? 如果您有一个真正需要跳过 robots.txt 的自定义用例,那么您可以执行以下操作
nutch 中的大部分插件(protocal-(http|httpclient|selenium|okhttp)) 使用HttpRobotRulesParser类来获取和解析 robots.txt 的内容
在HttpRobotsRulesParser这个特定方法中,您可以解析并返回规则对象。
public BaseRobotRules getRobotRulesSet(Protocol http, URL url,
List<Content> robotsTxtContent) {
if (LOG.isTraceEnabled() && isWhiteListed(url)) {
LOG.trace("Ignoring robots.txt (host is whitelisted) for URL: {}", url);
}
String cacheKey = getCacheKey(url);
BaseRobotRules robotRules = CACHE.get(cacheKey);
if (robotRules != null) {
return robotRules; // cached rule
} else if (LOG.isTraceEnabled()) {
LOG.trace("cache miss " + url);
}
boolean cacheRule = true;
URL redir = null;
if (isWhiteListed(url)) {
// check in advance whether a host is whitelisted
// (we do not need to fetch robots.txt)
robotRules = EMPTY_RULES;
LOG.info("Whitelisted host found for: {}", url);
LOG.info("Ignoring robots.txt for all URLs from whitelisted host: {}",
url.getHost());
} else {
try {
URL robotsUrl = new URL(url, "/robots.txt");
Response response = ((HttpBase) http).getResponse(robotsUrl,
new CrawlDatum(), false);
if (robotsTxtContent != null) {
addRobotsContent(robotsTxtContent, robotsUrl, response);
}
// try one level of redirection ?
if (response.getCode() == 301 || response.getCode() == 302) {
String redirection = response.getHeader("Location");
if (redirection == null) {
// some versions of MS IIS are known to mangle this header
redirection = response.getHeader("location");
}
if (redirection != null) {
if (!redirection.startsWith("http")) {
// RFC says it should be absolute, but apparently it isn't
redir = new URL(url, redirection);
} else {
redir = new URL(redirection);
}
response = ((HttpBase) http).getResponse(redir, new CrawlDatum(), false);
if (robotsTxtContent != null) {
addRobotsContent(robotsTxtContent, redir, response);
}
}
}
if (response.getCode() == 200) // found rules: parse them
robotRules = parseRules(url.toString(), response.getContent(),
response.getHeader("Content-Type"), agentNames);
else if ((response.getCode() == 403) && (!allowForbidden))
robotRules = FORBID_ALL_RULES; // use forbid all
else if (response.getCode() >= 500) {
//cacheRule = false; // try again later to fetch robots.txt
robotRules = EMPTY_RULES;
} else
robotRules = EMPTY_RULES; // use default rules
} catch (Throwable t) {
if (LOG.isInfoEnabled()) {
LOG.info("Couldn't get robots.txt for " + url + ": " + t.toString());
}
//cacheRule = false; // try again later to fetch robots.txt
robotRules = EMPTY_RULES;
}
}
if (cacheRule) {
CACHE.put(cacheKey, robotRules); // cache rules for host
if (redir != null && !redir.getHost().equalsIgnoreCase(url.getHost())
&& "/robots.txt".equals(redir.getFile())) {
// cache also for the redirected host
// if the URL path is /robots.txt
CACHE.put(getCacheKey(redir), robotRules);
}
}
return robotRules;
}
你可以继续用下面的方法替换
@Override
public BaseRobotRules getRobotRulesSet(Protocol http, URL url,
List<Content> robotsTxtContent) {
return EMPTY_RULES; // always return empty rules to skip robots.txt access.
}
您只是在通过返回 EMPTY_RULES 来嘲笑这种行为。
小提示:始终建议阅读和访问 robots.txt 中提到的资源。
于 2020-09-14T18:24:04.477 回答
1
据我了解,我们无法在 nutch 中禁用 robots.txt。
于 2013-03-01T21:28:25.510 回答