1

我正在尝试从特定用户那里获取所有推文:

def get_all_tweets(user_id, DEBUG):
    # Your bearer token here
    t = Twarc2(bearer_token="blah")

    # Initialize a list to hold all the tweepy Tweets
    alltweets = []
    new_tweets = {}

    if DEBUG:
        # Debug: read from file
        f = open('tweets_debug.txt',)
        new_tweets = json.load(f)
        alltweets.extend(new_tweets)
    else:
        # make initial request for most recent tweets (3200 is the maximum allowed count)
        new_tweets = t.timeline(user=user_id)
        # save most recent tweets
        alltweets.extend(new_tweets)

    if DEBUG:
        # Debug: write to file
        f = open("tweets_debug.txt", "w")
        f.write(json.dumps(alltweets, indent=2, sort_keys=False))
        f.close()

    # Save the id of the oldest tweet less one
    oldest = str(int(alltweets[-1]['meta']['oldest_id']) - 1)

    # Keep grabbing tweets until there are no tweets left to grab
    while len(dict(new_tweets)) > 0:
        print(f"getting tweets before {oldest}")
        
        # All subsiquent requests use the max_id param to prevent duplicates
        new_tweets = t.timeline(user=user_id,until_id=oldest)
        
        # Save most recent tweets
        alltweets.extend(new_tweets)
        
        # Update the id of the oldest tweet less one
        oldest = str(int(alltweets[-1]['meta']['oldest_id']) - 1)
        
        print(f"...{len(alltweets)} tweets downloaded so far")
    
    res = []
    for tweetlist in alltweets:
        res.extend(tweetlist['data'])
    
    f = open("output.txt", "w")
    f.write(json.dumps(res, indent=2, sort_keys=False))
    f.close()
    
    return res

但是,len(dict(new_tweets))不起作用。它总是返回 0。sum(1 for dummy in new_tweets)也返回 0。

我试过json.load(new_tweets)了,它也不起作用。

但是,alltweets.extend(new_tweets)工作正常。

似乎timeline()返回了一个生成器类型的值(<generator object Twarc2._timeline at 0x000001D78B3D8B30>)。有什么方法可以计算它的长度以确定是否还有更多未抓取的推文?

或者,有什么方法可以合并...

someList = []
someList.extend(new_tweets)
while len(someList) > 0:
    # blah blah

while...与?成一条线


编辑:我在 while 循环之前尝试过print(list(new_tweets)),它返回[]. 看起来对象实际上是的。

是因为alltweets.extend(new_tweets)以某种方式消耗了 new_tweets 生成器......?

4

1 回答 1

0

我自己想通了。这个问题可以通过将生成器转换为列表来解决:

new_tweets = list(t.timeline(user=user_id,until_id=oldest))
于 2021-10-16T21:10:27.787 回答