1
class sss(webapp.RequestHandler):
  def get(self):
    url = "http://www.google.com/"
    result = urlfetch.fetch(url)    
    if result.status_code == 200:
        self.response.out.write(result.content)

当我将代码更改为此:

if result.status_code == 200:
        self.response.out.write(result.content.decode('utf-8').encode('gb2312'))

它显示了一些奇怪的东西。我应该怎么办?

当我使用这个时:

self.response.out.write(result.content.decode('big5'))

该页面与我在 Google.com 看到的页面不同。

如何获取我看到的 Google.com?

4

2 回答 2

3

Google 可能正在为您提供 ISO-8859-1。至少,这就是他们为用户代理“AppEngine-Google;(+ http://code.google.com/appengine)”(urlfetch使用)为我服务的。Content-Type 标头值为:

text/html; charset=ISO-8859-1

所以你会使用:

result.content.decode('ISO-8859-1')

如果您选中result.headers["Content-Type"],您的代码可以适应另一端的更改。您通常可以将字符集(在本例中为 ISO-8859-1)直接传递给 Python 解码方法。

于 2010-05-22T10:51:51.553 回答
1

how to get google.com that i saw ?

It's probably using relative URLs to images, javascript, CSS, etc, that you're not changing into absolute URLs into google's site. To confirm this: your logs should be showing 404 errors ("page not found") as the browser to which you're serving "just the HTML" tries locating the relative-addressed resources that you're not supplying.

于 2010-05-22T17:30:02.513 回答