-1

我试图使用 TumblrAPI,具体来说是 PyTumblr,以抓取带有特定标签的帖子中的一些图像,

这就是我使用的代码,非常简单:

import pytumblr
from bs4 import BeautifulSoup

# Authenticate via API Key
client = pytumblr.TumblrRestClient('#Here is my API Key#')
print client.posts('wergida.tumblr.com', type='photo', tag='BERND AND HILLA BECHER', limit=1, offset=0)

所以结果是这样的:

{
  "meta": {
    "status": 200,
    "msg": "OK"
  },
  "response": {
    "blog": {
      "title": "W é r G i d A",
      "name": "wergida",
      "total_posts": 1181,
      "posts": 1181,
      "url": "http://wergida.tumblr.com/",
      "updated": 1466319493,
      "description": "Ha bárkit érdekelne",
      "is_nsfw": false,
      "ask": false,
      "ask_page_title": "Ask me anything",
      "ask_anon": false,
      "share_likes": true,
      "likes": 1131
    },
    "posts": [
      {
        "blog_name": "wergida",
        "id": 136740690571,
        "post_url": "http://wergida.tumblr.com/post/136740690571/bernhard-bernd-becher-1931-2007-and-hilla",
        "slug": "bernhard-bernd-becher-1931-2007-and-hilla",
        "type": "photo",
        "date": "2016-01-06 11:30:23 GMT",
        "timestamp": 1452079823,
        "state": "published",
        "format": "html",
        "reblog_key": "TiOl8nWT",
        "tags": [
          "industrial facades",
          "bernd and hilla becher",
          "photography",
          "eisenhüttenstadt",
          "brandenburg"
        ],
        "short_url": "https://tmblr.co/ZaE70t1-MOLgB",
        "summary": "Bernhard ‘Bernd’ Becher (1931-2007) and Hilla Becher (1934-2015): Eisenhüttenstadt, Brandenburg. Industrial Facades, The MIT...",
        "recommended_source": null,
        "recommended_color": null,
        "highlighted": [],
        "note_count": 2,
        "caption": "<p>Bernhard ‘Bernd’ Becher (1931-2007) and Hilla Becher (1934-2015): Eisenhüttenstadt, Brandenburg. Industrial Facades, The MIT Press, 1995.<br/></p>",
        "reblog": {
          "tree_html": "",
          "comment": "<p>Bernhard ‘Bernd’ Becher (1931-2007) and Hilla Becher (1934-2015): Eisenhüttenstadt, Brandenburg. Industrial Facades, The MIT Press, 1995.<br></p>"
        },
        "trail": [
          {
            "blog": {
              "name": "wergida",
              "active": true,
              "theme": {
                "avatar_shape": "square",
                "background_color": "#FAFAFA",
                "body_font": "Helvetica Neue",
                "header_bounds": "",
                "header_image": "https://secure.assets.tumblr.com/images/default_header/optica_pattern_05.png?_v=671444c5f47705cce40d8aefd23df3b1",
                "header_image_focused": "https://secure.assets.tumblr.com/images/default_header/optica_pattern_05.png?_v=671444c5f47705cce40d8aefd23df3b1",
                "header_image_scaled": "https://secure.assets.tumblr.com/images/default_header/optica_pattern_05.png?_v=671444c5f47705cce40d8aefd23df3b1",
                "header_stretch": true,
                "link_color": "#529ECC",
                "show_avatar": true,
                "show_description": true,
                "show_header_image": true,
                "show_title": true,
                "title_color": "#444444",
                "title_font": "Gibson",
                "title_font_weight": "bold"
              },
              "share_likes": true,
              "share_following": false
            },
            "post": {
              "id": "136740690571"
            },
            "content_raw": "<p>Bernhard ‘Bernd’ Becher (1931-2007) and Hilla Becher (1934-2015): Eisenhüttenstadt, Brandenburg. Industrial Facades, The MIT Press, 1995.<br></p>",
            "content": "<p>Bernhard ‘Bernd’ Becher (1931-2007) and Hilla Becher (1934-2015): Eisenhüttenstadt, Brandenburg. Industrial Facades, The MIT Press, 1995.<br /></p>",
            "is_current_item": true,
            "is_root_item": true
          }
        ],
        "image_permalink": "http://wergida.tumblr.com/image/136740690571",
        "photos": [
          {
            "caption": "",
            "alt_sizes": [
              {
                "url": "https://67.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_1280.jpg",
                "width": 1280,
                "height": 973
              },
              {
                "url": "https://66.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_500.jpg",
                "width": 500,
                "height": 380
              },
              {
                "url": "https://66.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_400.jpg",
                "width": 400,
                "height": 304
              },
              {
                "url": "https://65.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_250.jpg",
                "width": 250,
                "height": 190
              },
              {
                "url": "https://66.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_100.jpg",
                "width": 100,
                "height": 76
              },
              {
                "url": "https://66.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_75sq.jpg",
                "width": 75,
                "height": 75
              }
            ],
            "original_size": {
              "url": "https://67.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_1280.jpg",
              "width": 1280,
              "height": 973
            }
          }
        ]
      }
    ],
    "total_posts": 223
  }
}

但是当我使用 BeautifulSoup 解析我得到的信息时:

soup = BeautifulSoup(client.posts('wergida.tumblr.com', type='photo', tag='BERND AND HILLA BECHER', limit=1, offset=0),"lxml")

我懂了:

Traceback (most recent call last):
File "tumblr_test.py", line 29, in <module>
soup = BeautifulSoup(client.posts('wergida.tumblr.com', type='photo', tag='BERND AND HILLA BECHER', limit=1, offset=0),"lxml")
File "/Users/CB/Public/scrapy/env/lib/python2.7/site-packages/bs4/__init__.py", line 199, in __init__
if markup[:5] == "http:" or markup[:6] == "https:":
TypeError: unhashable type

而且我尝试了不同的解析器,如“html.parser”“html5lib”,仍然得到同样的错误。

感谢您提供任何线索!

4

1 回答 1

4

client.post()调用返回一个Python 字典,而不是包含 HTML 的字符串;它已经为您解析了 JSON 响应。因为 BeautifulSoup 试图将它视为一个字符串,所以你会得到错误,因为它作为slice object:5传递给字典,这是不可散列的:

>>> {}[:5]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type

字典不是 HTML。无需尝试使用 BeautifulSoup 解析它。只需访问嵌套结构中的单个数据元素;如果这样的元素本身是一个字符串并且该字符串包含 HTML 标记,那么解析该特定数据可能是有意义的:

response = client.posts('wergida.tumblr.com', type='photo', tag='BERND AND HILLA BECHER', limit=1, offset=0)
post = response['response']['posts'][0]
soup = BeautifulSoup(post['caption'])
于 2016-06-25T14:08:03.600 回答