python - 从 Tumblr API 打印 20 多个帖子

Question

下午好，

我对 Python 很陌生，但我正在尝试编写一个代码，该代码允许我将所有帖子（包括“笔记”）从指定的 Tumblr 帐户下载到我的计算机。

鉴于我对编码缺乏经验，我试图找到一个可以让我这样做的预制脚本。我在 GitHub 上找到了几个出色的脚本，但它们都没有真正返回 Tumblr 帖子中的注释（据我所知，如果有人知道，请纠正我！）。

因此，我尝试编写自己的脚本。我在下面的代码中取得了一些成功。它打印来自给定 Tumblr 的最近 20 个帖子（尽管格式相当丑陋——基本上数百行文本都打印到记事本文件的一行中）：

#This script prints all the posts (including tags, comments) and also the 
#first 20notes from all the Tumblr blogs.

import pytumblr

# Authenticate via API Key
client = pytumblr.TumblrRestClient('myapikey')

#offset = 0

# Make the request
client.posts('staff', limit=2000, offset=0, reblog_info=True, notes_info=True, 
filter='html')
#print out into a .txt file
with open('out.txt', 'w') as f:
print >> f, client.posts('staff', limit=2000, offset=0, reblog_info=True, 
notes_info=True, filter='html')

但是，我希望脚本连续打印帖子，直到到达指定博客的末尾。

我搜索了这个网站，发现了一个非常相似的问题（Getting only 20 posts returned through PyTumblr），stackoverflow 用户戳回答了这个问题。但是，我似乎无法真正实现 poke 的解决方案，以便它适用于我的数据。实际上，当我运行以下脚本时，根本不会产生任何输出。

import pytumblr

# Authenticate via API Key
client = pytumblr.TumblrRestClient('myapikey')
blog = ('staff')
def getAllPosts (client, blog):
offset = 0
while True:
    posts = client.posts(blog, limit=20, offset=offset, reblog_info=True, notes_info=True)
    if not posts:
        return

    for post in posts:
        yield post


    offset += 20

我应该注意到，这个网站上有几篇关于 Tumblr 笔记的帖子（例如，使用 Tumblr API 获得超过 50 条笔记），其中大多数都询问如何在每个帖子中下载超过 50 条笔记。我对每个帖子只有 50 个注释感到非常满意，这是我想增加的帖子数量。

此外，我已将这篇文章标记为 Python，但是，如果有更好的方法来使用另一种编程语言获取我需要的数据，那就更好了。

非常感谢您抽出宝贵时间！

score 3 · Accepted Answer

tl;dr 如果您只想查看答案，它位于标题A 更正版本之后的底部

第二个代码片段是一个生成器，它会一个一个地生成帖子，因此您必须将其用作循环之类的一部分，然后对输出进行一些处理。这是您的代码，其中包含一些额外的代码，这些代码遍历生成器并打印出它返回的数据。

import pytumblr

def getAllPosts (client, blog):
    offset = 0
    while True:
        posts = client.posts(blog, limit=20, offset=offset, reblog_info=True, notes_info=True)
        if not posts:
            return

        for post in posts:
            yield post

        offset += 20

# Authenticate via API Key
client = pytumblr.TumblrRestClient('myapikey')
blog = ('staff')

# use the generator getAllPosts
for post in getAllPosts(client, blog):
    print(post)

但是，该代码中有几个错误。getAllPosts不会只产生每个帖子，它还会返回其他内容，因为它将遍历 API 响应，正如您从我在ipythonshell 中运行的这个示例中看到的那样。

In [7]: yielder = getAllPosts(client, 'staff')

In [8]: next(yielder)
Out[8]: 'blog'

In [9]: next(yielder)
Out[9]: 'posts'

In [10]: next(yielder)
Out[10]: 'total_posts'

In [11]: next(yielder)
Out[11]: 'supply_logging_positions'

In [12]: next(yielder)
Out[12]: 'blog'

In [13]: next(yielder)
Out[13]: 'posts'

In [14]: next(yielder)
Out[14]: 'total_posts'

发生这种情况是因为 in 中的posts对象getAllPosts是一个字典，其中包含的不仅仅是博客中的每篇文章staff- 它还包含博客包含的文章数量、博客的描述、上次更新时间等项目。代码原样可能会导致无限循环，因为以下条件：

if not posts:
    return

由于响应结构的原因，永远不会是真的，因为一个空的 Tumblr API 响应pytumblr看起来像这样：

{'blog': {'ask': False,
  'ask_anon': False,
  'ask_page_title': 'Ask me anything',
  'can_send_fan_mail': False,
  'can_subscribe': False,
  'description': '',
  'followed': False,
  'is_adult': False,
  'is_blocked_from_primary': False,
  'is_nsfw': False,
  'is_optout_ads': False,
  'name': 'asdfasdf',
  'posts': 0,
  'reply_conditions': '3',
  'share_likes': False,
  'subscribed': False,
  'title': 'Untitled',
  'total_posts': 0,
  'updated': 0,
  'url': 'https://asdfasdf.tumblr.com/'},
 'posts': [],
 'supply_logging_positions': [],
 'total_posts': 0}

if not posts将针对该结构而不是posts字段（此处为空列表）进行检查，因此条件永远不会失败，因为响应字典不为空（请参阅：Python 中的真值测试）。

修正版

下面的代码（主要是经过测试/验证）修复了您的getAllPosts实现中的循环，然后使用该函数检索帖子并将它们转储到名为(BLOG_NAME)-posts.txt.

import pytumblr


def get_all_posts(client, blog):
    offset = 0
    while True:
        response = client.posts(blog, limit=20, offset=offset, reblog_info=True, notes_info=True)

        # Get the 'posts' field of the response        
        posts = response['posts']

        if not posts: return

        for post in posts:
            yield post

        # move to the next offset
        offset += 20


client = pytumblr.TumblrRestClient('secrety-secret')
blog = 'staff'

# use our function
with open('{}-posts.txt'.format(blog), 'w') as out_file:
    for post in get_all_posts(client, blog):
        print >>out_file, post
        # if you're in python 3.x, use the following
        # print(post, file=out_file)

这将只是 API 发布响应的直接文本转储，因此如果您需要使其看起来更好或任何其他内容，这取决于您。

python - 从 Tumblr API 打印 20 多个帖子

1 回答 1

tl;dr 如果您只想查看答案，它位于标题A 更正版本之后的底部

修正版

Related

Reference