0

Before laying my question bare, some context is needed. I'm trying to issue HTTP GET and POST requests to a website, with the following caveats:

  • Redirects are expected
  • Cookies are required
  • Requests must pass through a SOCKS proxy (v4a)

Up until now, I've been using twisted.web.client.Agent and it's subclasses (e.g. BrowserLikeRedirectAgent), but unfortunately it seems as though SOCKS proxies are not supported yet (and ProxyAgent is a no-go because this class is for HTTP proxies).

I stumbled upon twisted-socks, which seems to allow me to do what I want, but I noticed that it uses HttpClientFactory instead of agent, hence my question: what is the difference between HttpClientFactory and Agent and when should I use each one?

Below is some example code using twisted-socks. I have two additional questions:

  1. How can I use cookies in this example? I tried passing a dict and a cookielib.CookieJar instance to HttpClientFactory's cookies kwarg, but this raises an error (something about a string being expected... how on earth do I send cookies as a string?)

  2. Can this code be refactored to use Agent? This would be ideal, as I already have a reasonably large codebase that is written with Agent in mind.

```

import sys
from urlparse import urlparse
from twisted.internet import reactor, endpoints
from socksclient import SOCKSv4ClientProtocol, SOCKSWrapper
from twisted.web import client

class mything:
    def __init__(self):
        self.npages = 0
        self.timestamps = {}

    def wrappercb(self, proxy):
        print "connected to proxy", proxy

    def clientcb(self, content):
        print "ok, got: %s" % content[:120]
        print "timetamps " + repr(self.timestamps)
        self.npages -= 1
        if self.npages == 0:
            reactor.stop()

    def sockswrapper(self, proxy, url):
        dest = urlparse(url)
        assert dest.port is not None, 'Must specify port number.'
        endpoint = endpoints.TCP4ClientEndpoint(reactor, dest.hostname, dest.port)
        return SOCKSWrapper(reactor, proxy[1], proxy[2], endpoint, self.timestamps)

def main():
    thing = mything()

    # Mandatory first argument is a URL to fetch over Tor (or whatever
    # SOCKS proxy that is running on localhost:9050).
    url = sys.argv[1]
    proxy = (None, 'localhost', 9050, True, None, None)

    f = client.HTTPClientFactory(url)
    f.deferred.addCallback(thing.clientcb)
    sw = thing.sockswrapper(proxy, url)
    d = sw.connect(f)
    d.addCallback(thing.wrappercb)
    thing.npages += 1

    reactor.run()

if '__main__' == __name__:
    main()

```

4

1 回答 1

4

我认为您通常不会使用 a HTTPClientFactory,因为它似乎只是一个执行 HTTP 请求的东西,仅此而已。这是相当低级的。

如果你只是想触发一个请求,有一些函数 (twisted.web.client.getPage.downloadPage) 可以为你构建工厂,同时处理 HTTP 和 HTTPS。

Agent是一个给你更高层次的抽象的东西:它保持一个连接池,处理基于 url 的 HTTP/HTTPS 选择,处理代理等。没错,这就是你通常想要使用的东西。

似乎他们没有共享太多代码,并且 Agent与旧的HTTP11ClientProtocol(及其协议, )HTTP11ClientFactory一样(和)。所以有一个vs (作为它的公共 API)的二元性。我猜是历史原因和向后兼容性。getPageHTTPClientFactoryHTTPPageGettertwisted.web.client._newclientAgent

无论如何,这个库不能很好地与Agent开箱即用的混合,因为 API 被破坏了。twisted-socksSOCKSWrapper声明它实现了IStreamClientEndpoint接口,但是接口要求该.connect方法返回一个将与IProtocol提供者一起触发的延迟(请参阅文档),同时SOCKSWrapper返回一个与地址触发的延迟(这是执行此操作的行)。看来您可以轻松地将其更改为:

self.handshakeDone.callback(self.transport.protocol)

一旦你这样做了,你应该能够使用Agent. 这是一个示例:(使用inlineCallbacks和 new react,但您也可以使用标准的 .addCallback 和 deferreds 和reactor.run()

from twisted.internet.endpoints import TCP4ClientEndpoint
from twisted.internet.defer import inlineCallbacks
from twisted.internet.task import react
from twisted.web.client import ProxyAgent, readBody

from socksclient import SOCKSWrapper

@react
@inlineCallbacks
def main(reactor):
    target = TCP4ClientEndpoint(reactor, 'example.com', 80)
    proxy = SOCKSWrapper(reactor, 'localhost', 9050, target)
    agent = ProxyAgent(proxy)
    request = yield agent.request('GET', 'http://example.com/')
    print (yield readBody(request))

此外,还有一个似乎更好用的txsocksx库(并且可以通过 pip 安装!)。API 几乎相同,但是您传递了之前将传递代理端点的目标端点:

from twisted.internet.endpoints import TCP4ClientEndpoint
from twisted.internet.defer import inlineCallbacks
from twisted.internet.task import react
from twisted.web.client import ProxyAgent, readBody

from txsocksx.client import SOCKS5ClientEndpoint

@react
@inlineCallbacks
def main(reactor):
    proxy = TCP4ClientEndpoint(reactor, 'localhost', 9050)
    proxied_endpoint = SOCKS5ClientEndpoint('example.com', 80, proxy)
    agent = ProxyAgent(proxied_endpoint)
    request = yield agent.request('GET', 'http://example.com/')
    print (yield readBody(request))
于 2013-08-01T07:32:04.553 回答