python - Python：计算用户在推文中使用的图像和视频的数量

Question

我抓取了推特数据，但没有使用 tweepy，我想获取每个用户在推文中使用的图像/视频的数量。我所拥有的：推文 URL：“https://twitter.com/user_screen_name/status/tweet_id，我还有 user_id 和推文（文本 + 链接 + 媒体）。

我想要做的是检查推文是否包含视频，如果是，则计算它并与图像相同。我注意到推文中使用的链接以“../t.co..”开头，因此它们基本上是重定向链接。此外，推文中显示的图像/视频基本上是重定向链接中包含的图像/视频（这就是我所理解的）

我尝试使用此代码进行图像计数，但没有得到任何结果：

import urllib
from bs4 import BeautifulSoup
from urllib.request import urlopen   
def get_image_count(url):              
    soup = bs4.BeautifulSoup(urlopen((url))
    images = soup.findAll('img')
    file_types= '//img[contains(@src, ".jpg") or contains(@src, ".jpeg") or contains(@src, ".png")]'
    # loop through all img elements found and store the urls with matching extensions
    urls = list(x for x in images if x['src'].split('.')[-1] in file_types)
    print(urls)
    return len(urls)

当我使用此链接='https://twitter.com/fritzlabs/status/1369661296162054145'运行此代码时，这就是我得到的输出：

[<img alt="Twitter" height="38" src="https://abs.twimg.com/errors/logo46x38.png" srcset="https://abs.twimg.com/errors/logo46x38.png 1x, https://abs.twimg.com/errors/logo46x38@2x.png 2x" width="46"/>]

1

请问这里有什么帮助吗？我尝试了其他代码，但得到了相同的输出。谢谢你

score 1 · Accepted Answer

发生这种情况是因为从请求返回的 HTML 不是推文，而是一条警告说 Javascript 已禁用。这不是您的脚本的错误，当您在浏览器中发出请求时也会发生这种情况，无论是否启用了 javascript。

当向您的示例推文发出浏览器请求时，将返回禁用的 javascript HTML，然后 javascript 会运行并加载到实际的推文中。

要查看此操作，请打开 Chrome 或 Firefox，按 F12 并转到“网络”选项卡。访问您的页面。第一个请求与您在 python 中发出的请求相同，即 tweet 1369661296162054145。如果您查看该请求响应的预览，您将看到 javascript 警告。

在网络选项卡的下方，您将看到对1369661296162054145.json. 这是返回实际推文的请求，也是您需要复制的请求。

score 0 · Accepted Answer

所以我尝试按照我检查的一些帖子中的建议将硒与 PhantomJS 驱动程序一起使用。这是我尝试过的代码：

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import ElementNotVisibleException
import requests

link = 'https://twitter.com/fritzlabs/status/1369661296162054145'
driver = webdriver.PhantomJS()
driver.get(link)
image_src = driver.find_element_by_tag_name('img').get_attribute('src')
print(image_src)
response = requests.get(image_src).content
print(response)

我试图打印“image_src”以了解它。当我运行代码时，这就是我得到的：

NoSuchElementException: Message: {"errorMessage":"Unable to find element with tag name 'img'","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Content-Length":"90","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:63767","User-Agent":"selenium/3.141.0 (python windows)"},"httpVersion":"1.1","method":"POST","post":"{\"using\": \"tag name\", \"value\": \"img\", \"sessionId\": \"5bba45c0-8279-11eb-b30c-d7ded72a9eb3\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/5bba45c0-8279-11eb-b30c-d7ded72a9eb3/element"}}
Screenshot: available via screen

我真的不熟悉硒，对beautifulsoup不太熟悉，所以任何人都可以帮助我，我会很感激。谢谢你

python - Python：计算用户在推文中使用的图像和视频的数量

2 回答 2

Related

Reference