3

网络爬虫 Apache Nutch 内置了对 NTLM 的支持。我正在尝试使用 1.7 版使用 NTLM 身份验证来抓取网站(Windows Sharepoint)。我已经根据https://wiki.apache.org/nutch/HttpAuthenticationSchemes设置了 Nutch,这尤其意味着我有凭据

<credentials username="rickert" password="mypassword">
  <authscope host="server-to-be-crawled.com" port="80" realm="CORP" scheme="NTLM"/>
</credentials>

配置。当我查看日志文件时,我可以看到 Nutch 尝试访问种子 URL 并经历“正常”NTLM 循环:在第一次 GET 期间获得 401 错误,提取 NTLM 质询并在下一次 GET 中发送 NTLM 身份验证(使用保持活动连接)。但是,第二个 GET 也不成功。

这就是我怀疑我的凭据或特定设置存在一些基本问题的时候:我在 Windows 主机上的 Debian 来宾虚拟盒中运行 Nutch。但令我惊讶的是wgetcurl他们能够使用我的凭据从 Debian 来宾中检索文档。有趣的是,这两个命令行工具只需要用户名和密码即可工作。另一方面,完整的 NTLM 规范也需要主机。根据规范,主机是请求源自的主机,我将其解释为运行 http-agent 的主机,即在与用户名关联的 Windows 域中。我的假设是这两个工具都只是将这些细节留空。

这就是 Nutch 配置的用武之地:据称主机http.agent.host在配置文件中提供的。该应该被配置为凭证的领域,但文档却说这是一个约定,并不是真正必要的。但是,我是否设置了一个领域并不重要,结果是一样的。再次查看日志文件,我可以看到一些消息,<any_realm>@server-to-be-crawled.com无论我使用哪个领域,都可以使用身份验证进行解析。

我的直觉是将 Nutch 配置值映射到httpclient执行 GET 的 Java 类所需的 NTLM 参数上存在一些错误。我很无奈。谁能给我一些关于如何进一步调试的提示?有人有适用于 SharePoint Server 的具体配置吗?谢谢!

4

1 回答 1

1

这是一个旧线程,但它似乎是一个常见问题,我终于找到了解决方案。

就我而言,问题是我试图抓取的内容源托管在一个相当最新的 IIS 服务器上。检查标头表明它正在使用 NTLMv1,但在阅​​读 Apache Commons HttpClient v3.x 仅支持 NTLMv1 而不支持 NTLMv2 之后,我开始寻找一种方法来将该支持添加到 nutch v1.15,而无需升级到较新的 HttpComponents 版本的 HttpClient。

线索在HttpClient 的较新 HC 版本的文档中。 因此,通过JCIFS 使用这种方法,我设法修改了 nutch 协议 httpclient Http 类,以便它使用我新的基于 JCIFS 的 NTLM 方案进行身份验证。执行此操作的步骤:

  1. 创建新的基于 JCIFS 的 NTLMScheme
  2. 在Http.configureClient中,注册新方案的使用
  3. 将 JCIFS 添加到 nutch 协议-httpclient 插件类路径

工作完成后,我就可以爬取受 NTLMv2 保护的网站了。

通过添加大量额外的日志记录,我可以看到身份验证握手细节,这表明它实际上正在使用 NTLMv2。

Http.configureClient 中的更改如下所示:

  /** Configures the HTTP client */
  private void configureClient() {
    LOG.info("Setting new NTLM scheme: " + JcifsNtlmScheme.class.getName());
    AuthPolicy.registerAuthScheme(AuthPolicy.NTLM, JcifsNtlmScheme.class);
    ...
  }

新的 NTLM 方案实现看起来像这样(需要整理一下)。


public class JcifsNtlmScheme implements AuthScheme {

    public static final Logger LOG = LoggerFactory.getLogger(JcifsNtlmScheme.class);

    /** NTLM challenge string. */
    private String ntlmchallenge = null;

    private static final int UNINITIATED = 0;
    private static final int INITIATED = 1;
    private static final int TYPE1_MSG_GENERATED = 2;
    private static final int TYPE2_MSG_RECEIVED = 3;
    private static final int TYPE3_MSG_GENERATED = 4;
    private static final int FAILED = Integer.MAX_VALUE;

    /** Authentication process state */
    private int state;

    public JcifsNtlmScheme() throws AuthenticationException {
        // Check if JCIFS is present. If not present, do not proceed.
        try {
            Class.forName("jcifs.ntlmssp.NtlmMessage", false, this.getClass().getClassLoader());
            LOG.trace("jcifs.ntlmssp.NtlmMessage is present");
        } catch (ClassNotFoundException e) {
            throw new AuthenticationException("Unable to proceed as JCIFS library is not found.");
        }
    }

    public String authenticate(Credentials credentials, HttpMethod method) throws AuthenticationException {
        LOG.trace("authenticate called. State: " + this.state);
        if (this.state == UNINITIATED) {
            throw new IllegalStateException("NTLM authentication process has not been initiated");
        }

        NTCredentials ntcredentials = null;
        try {
            ntcredentials = (NTCredentials) credentials;
        } catch (ClassCastException e) {
            throw new InvalidCredentialsException(
                    "Credentials cannot be used for NTLM authentication: " + credentials.getClass().getName());
        }

        NTLM ntlm = new NTLM();
        String charset = method.getParams().getCredentialCharset();
        LOG.trace("Setting credential charset to: " + charset);
        ntlm.setCredentialCharset(charset);

        String response = null;
        if (this.state == INITIATED || this.state == FAILED) {
            LOG.trace("Generating TYPE1 message");
            response = ntlm.generateType1Msg(ntcredentials.getHost(), ntcredentials.getDomain());
            this.state = TYPE1_MSG_GENERATED;
        } else {
            LOG.trace("Generating TYPE3 message");
            response = ntlm.generateType3Msg(ntcredentials.getUserName(), ntcredentials.getPassword(),
                    ntcredentials.getHost(), ntcredentials.getDomain(), this.ntlmchallenge);
            this.state = TYPE3_MSG_GENERATED;
        }

        String result = "NTLM " + response;
        return result;

    }

    public String authenticate(Credentials credentials, String method, String uri) throws AuthenticationException {
        throw new RuntimeException("Not implemented as it is deprecated anyway in Httpclient 3.x");
    }

    public String getID() {
        throw new RuntimeException("Not implemented as it is deprecated anyway in Httpclient 3.x");
    }

    /**
     * Returns the authentication parameter with the given name, if available.
     *
     * 

* There are no valid parameters for NTLM authentication so this method always * returns null. *

* * @param name The name of the parameter to be returned * * @return the parameter with the given name */ public String getParameter(String name) { if (name == null) { throw new IllegalArgumentException("Parameter name may not be null"); } return null; } /** * The concept of an authentication realm is not supported by the NTLM * authentication scheme. Always returns null. * * @return null */ public String getRealm() { return null; } /** * Returns textual designation of the NTLM authentication scheme. * * @return ntlm */ public String getSchemeName() { return "ntlm"; } /** * Tests if the NTLM authentication process has been completed. * * @return true if Basic authorization has been processed, * false otherwise. * * @since 3.0 */ public boolean isComplete() { boolean result = this.state == TYPE3_MSG_GENERATED || this.state == FAILED; LOG.trace("isComplete? " + result); return result; } /** * Returns true. NTLM authentication scheme is connection based. * * @return true. * * @since 3.0 */ public boolean isConnectionBased() { return true; } /** * Processes the NTLM challenge. * * @param challenge the challenge string * * @throws MalformedChallengeException is thrown if the authentication challenge * is malformed * * @since 3.0 */ public void processChallenge(final String challenge) throws MalformedChallengeException { String s = AuthChallengeParser.extractScheme(challenge); LOG.trace("processChallenge called. challenge: " + challenge + " scheme: " + s); if (!s.equalsIgnoreCase(getSchemeName())) { LOG.trace("Invalid scheme name in challenge. Should be: " + getSchemeName()); throw new MalformedChallengeException("Invalid NTLM challenge: " + challenge); } int i = challenge.indexOf(' '); if (i != -1) { LOG.trace("processChallenge: TYPE2 message received"); s = challenge.substring(i, challenge.length()); this.ntlmchallenge = s.trim(); this.state = TYPE2_MSG_RECEIVED; } else { this.ntlmchallenge = ""; if (this.state == UNINITIATED) { this.state = INITIATED; LOG.trace("State was UNINITIATED, switching to INITIATED"); } else { LOG.trace("State is FAILED"); this.state = FAILED; } } } private class NTLM { /** Character encoding */ public static final String DEFAULT_CHARSET = "ASCII"; /** * The character was used by 3.x's NTLM to encode the username and password. * Apparently, this is not needed in when passing username, password from * NTCredentials to the JCIFS library */ private String credentialCharset = DEFAULT_CHARSET; void setCredentialCharset(String credentialCharset) { this.credentialCharset = credentialCharset; } private String generateType1Msg(String host, String domain) { jcifs.ntlmssp.Type1Message t1m = new jcifs.ntlmssp.Type1Message( jcifs.ntlmssp.Type1Message.getDefaultFlags(), domain, host); String result = jcifs.util.Base64.encode(t1m.toByteArray()); LOG.trace("generateType1Msg: " + result); return result; } private String generateType3Msg(String username, String password, String host, String domain, String challenge) { jcifs.ntlmssp.Type2Message t2m; try { t2m = new jcifs.ntlmssp.Type2Message(jcifs.util.Base64.decode(challenge)); } catch (IOException e) { throw new RuntimeException("Invalid Type2 message", e); } jcifs.ntlmssp.Type3Message t3m = new jcifs.ntlmssp.Type3Message(t2m, password, domain, username, host, 0); String result = jcifs.util.Base64.encode(t3m.toByteArray()); LOG.trace("generateType3Msg username: [" + username + "] host: [" + host + "] domain: [" + domain + "] response: [" + result + "]"); return result; } } }
于 2019-07-23T18:35:37.943 回答