python - Google App Engine - 无法正确接收 GZIP HTML 文件

Question

Python 和 Google App Engine 专家，

我想检索位于此链接的道明银行抵押贷款利率网站：

“http://tdbank.mortgagewebcenter.com/Default.asp”

今天晚上我通过教程学习了 Python 和 Google App Engine，但我一直被我认为可能是 GZIP 问题所困扰。

理想情况下，我希望有人修复我在下面粘贴的代码。或者提供正确的代码（如果这样更容易）以成功接收该网页并能够在 python/google 应用程序引擎中解析它。

尝试 1 - URLFETCH

import webapp2
import gzip

import StringIO

from google.appengine.api import users
from google.appengine.api import urlfetch
from BeautifulSoup import BeautifulSoup

class MainPage(webapp2.RequestHandler):
    def get(self):
        url = "http://tdbank.mortgagewebcenter.com/Default.asp"
        result = urlfetch.fetch(url=url,headers={'User-Agent': 'Mozilla/5.0',
                                                 'Accept': 'text/html',
                                                 'Accept-Language': 'en-us,en',
                                                 'Accept-Encoding': 'gzip',
                                                 'Connection': 'keep-alive'})
        f = StringIO.StringIO(result.content)
        c = gzip.GzipFile(fileobj=f)
        content = c.read()
        self.response.out.write(content)

app = webapp2.WSGIApplication([('/', MainPage)],
                              debug=True)

尝试 2 - URLLIB2

import cgi
import webapp2
import gzip
import StringIO
import urllib2
import httplib

from BeautifulSoup import BeautifulSoup

class MainPage(webapp2.RequestHandler):
    def get(self):
        httplib.HTTPConnection.debuglevel = 1
        url = urllib2.Request('http://tdbank.mortgagewebcenter.com/Default.asp')
        url.add_header('Accept-encoding', 'gzip')
        url.add_header('User-Agent', 'Mozilla/5.0')
        opener = urllib2.build_opener()
        f = opener.open(url)
        compresseddata = f.read()
    compressedstream = StringIO.StringIO(compresseddata)
        c = gzip.GzipFile(fileobj=compressedstream)
        content = c.read()
        self.response.out.write(content)

app = webapp2.WSGIApplication([('/', MainPage)],
                              debug=True)

YAML 文件：

application: fimrates
version: 2
runtime: python27
api_version: 1
threadsafe: true

handlers:
- url: /.*
script: fimrates.app

在这两种情况下，我的浏览器都会被重定向到

http://localhost:8080/Default.asp?bhjs=1&bhqs=1

如果我将尝试读取的 URL 更改为另一个网页，例如 www.google.com，则输出会正确打印。

提前感谢您的帮助，我真的很感激。

-托德

score 0 · Accepted Answer

您发布的网址在 javascript 中进行了重定向。获得最终页面的唯一方法是模拟浏览器，这在 GAE 上是 IMO 不可能的。

我通过下载了 html curl -L http://tdbank.mortgagewebcenter.com/Default.asp，它给了我“不支持的浏览器”。这意味着该页面会在 javascript 中检查浏览器的类型。

http://tinypic.com/r/aa7tqd/6

score 0 · Accepted Answer

0

在您的 fetch 命令中，尝试添加参数“follow_redirects=True”。

于 2012-08-26T22:42:41.457 回答

python - Google App Engine - 无法正确接收 GZIP HTML 文件

2 回答 2

Related

Reference