6

网页显示有702条评论。
目标 youtube 示例 我编写了一个函数,从 github 上的项目中复制了许多代码。
在此处输入图像描述
get_total_youtube_comments(url)

github上的项目

def get_total_youtube_comments(url):
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from selenium.common.exceptions import TimeoutException
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    import time
    options = webdriver.ChromeOptions()
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    options.add_argument("--headless")
    driver = webdriver.Chrome(options=options,executable_path='/usr/bin/chromedriver')
    wait = WebDriverWait(driver,60)
    driver.get(url)
    SCROLL_PAUSE_TIME = 2
    CYCLES = 7
    html = driver.find_element_by_tag_name('html')
    html.send_keys(Keys.PAGE_DOWN)   
    html.send_keys(Keys.PAGE_DOWN)   
    time.sleep(SCROLL_PAUSE_TIME * 3)
    for i in range(CYCLES):
        html.send_keys(Keys.END)
        time.sleep(SCROLL_PAUSE_TIME)
    comment_elems = driver.find_elements_by_xpath('//*[@id="content-text"]')
    all_comments = [elem.text for elem in comment_elems]
    return  all_comments

尝试解析示例网页上的所有评论https://www.youtube.com/watch?v=N0lxfilGfak

url='https://www.youtube.com/watch?v=N0lxfilGfak'
list = get_total_youtube_comments(url)

它可以得到一些评论,只是所有评论中的一小部分。

len(list)
60

60远远少于702,如何使用硒在 youtube 中获取所有评论?
@supputuri,我可以使用您的代码提取所有注释。

comments_list = driver.find_elements_by_xpath("//*[@id='content-text']")
len(comments_list)
709
print(driver.find_element_by_xpath("//h2[@id='count']").text)
717 Comments
comments_list[-1].text
'mistake at 23:11 \nin NOT it should return false if x is true.'
comments_list[0].text
'Got a question on the topic? Please share it in the comment section below and our experts will answer it for you. For Edureka Python Course curriculum, Visit our Website:  Use code "YOUTUBE20" to get Flat 20% off on this training.'

为什么评论数是 709 而不是页面中显示的 717?

4

4 回答 4

7

您收到的评论数量有限,因为 YouTube 会在您继续向下滚动时加载评论。该视频大约有 394 条评论,您必须首先确保所有评论都已加载,然后再展开所有View Replies评论,以达到最大评论数。

注意:我能够使用以下代码行获得 700 条评论。

# get the last comment
lastEle = driver.find_element_by_xpath("(//*[@id='content-text'])[last()]")
# scroll to the last comment currently loaded
lastEle.location_once_scrolled_into_view
# wait until the comments loading is done
WebDriverWait(driver,30).until(EC.invisibility_of_element((By.CSS_SELECTOR,"div.active.style-scope.paper-spinner")))

# load all comments
while lastEle != driver.find_element_by_xpath("(//*[@id='content-text'])[last()]"):
    lastEle = driver.find_element_by_xpath("(//*[@id='content-text'])[last()]")
    driver.find_element_by_xpath("(//*[@id='content-text'])[last()]").location_once_scrolled_into_view
    time.sleep(2)
    WebDriverWait(driver,30).until(EC.invisibility_of_element((By.CSS_SELECTOR,"div.active.style-scope.paper-spinner")))

# open all replies
for reply in driver.find_elements_by_xpath("//*[@id='replies']//paper-button[@class='style-scope ytd-button-renderer'][contains(.,'View')]"):
    reply.location_once_scrolled_into_view
    driver.execute_script("arguments[0].click()",reply)
time.sleep(5)
WebDriverWait(driver, 30).until(
        EC.invisibility_of_element((By.CSS_SELECTOR, "div.active.style-scope.paper-spinner")))
# print the total number of comments
print(len(driver.find_elements_by_xpath("//*[@id='content-text']")))
于 2020-07-08T05:00:25.907 回答
5

有几件事:

  • 网站https://www.youtube.com/中的WebElement是动态的。动态呈现的评论也是如此。
  • 在网页中,除非用户在Viewporthttps://www.youtube.com/watch?v=N0lxfilGfak中滚动以下元素,否则评论不会呈现。

教育

  • 评论在:

    <!--css-build:shady-->
    

    其中适用,Polymer CSS Builder用于应用 Polymer 的 CSS Mixin shim 和 ShadyDOM 范围。所以在默认设置下,仍然需要做一些运行时工作来转换 CSS 选择器。


考虑到上述因素,这里有一个检索所有评论的解决方案:

代码块:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException, ElementClickInterceptedException, WebDriverException
import time

options = webdriver.ChromeOptions() 
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get('https://www.youtube.com/watch?v=N0lxfilGfak')
driver.execute_script("return scrollBy(0, 400);")
subscribe = WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH, "//yt-formatted-string[text()='Subscribe']")))
driver.execute_script("arguments[0].scrollIntoView(true);",subscribe)
comments = []
my_length = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//yt-formatted-string[@class='style-scope ytd-comment-renderer' and @id='content-text'][@slot='content']"))))
while True:
    try:
        driver.execute_script("window.scrollBy(0,800)")
        time.sleep(5)
        comments.append([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//yt-formatted-string[@class='style-scope ytd-comment-renderer' and @id='content-text'][@slot='content']")))])
    except TimeoutException:
        driver.quit()
        break
print(comment)
于 2020-07-09T23:43:25.917 回答
4

我对 python 不熟悉,但我会告诉你我会采取哪些步骤来获得所有评论。首先,如果您的代码我认为主要问题在于

CYCLES = 7

据此,您将滚动 2 秒 7 次。由于您已成功获取 60 条评论,因此修复上述条件将解决您的问题。

我假设您在使用定位器在网站上查找元素时没有任何问题。

  1. 您需要获取总评论数以将变量计为 int。(在您的情况下,假设它是 COMMENTS = 715)

  2. 定义另一个变量 VISIBLECOUNTS = 0

  3. 如果COMMENTS > VISIBLECOUNTS则使用 while 循环滚动

  4. 代码可能看起来像这样(如果有语法问题真的很抱歉)

    // python - selenium command to get all comments counts.
    COMMENTS = 715
    (715 is just a sample value, it will change upon the total comments count)
    VISIBLECOUNTE = 0 
    SCROLL_PAUSE_TIME = 2
    
    while VISIBLECOUNTS  < COMMENTS :
    html.send_keys(Keys.END)
    time.sleep(SCROLL_PAUSE_TIME)
    VISIBLECOUNTS = len(driver.find_elements_by_xpath('//ytm-comment-thread-renderer'))
    

    这样,您将向下滚动直到 COMMENTS = VISIBLECOUNTS。然后您可以获取所有评论,因为它们都共享相同的元素属性,例如ytm-comment-thread-renderer

    由于我对 python 不熟悉,我将添加命令以使评论从 js 中计数。你可以在你的浏览器上试试这个并将它转换成你的python命令

在控制台中运行以下查询并检查。

To get total comments count
var comments = document.querySelector(".comment-section-header-text").innerText.split(" ")
//We can get the text value "Comments • 715" and split by spaces and get the last value

Number(comments[comments.length -1])
//Then convirt string "715" to int, you just need to do these in python - selenium
To get active comments count
$x("//ytm-comment-thread-renderer").length

注意:如果很难提取值,您仍然可以使用 selenium js 执行器并使用 js 进行滚动,直到所有评论都可见。但我想在 python 中做到这一点并不难,因为逻辑是一样的。

对于无法在 python 中添加解决方案,我感到非常抱歉。但希望这会有所帮助。干杯。

于 2020-07-09T18:50:25.900 回答
4

如果您不必使用 Selenium,我建议您查看 google/youtube api。

https://developers.google.com/youtube/v3/getting-started

例子 :

https://www.googleapis.com/youtube/v3/commentThreads?key=YourAPIKey&textFormat=plainText&part=snippet&videoId=N0lxfilGfak&maxResults=100

这将为您提供前 100 个结果并为您提供一个令牌,您可以在下一个请求中附加该令牌以获得接下来的 100 个结果。

于 2020-07-08T13:45:11.923 回答