python - HTML标签之间的硒

Question

获取由 Javascript 创建的页面中的所有 HTML 以传递给 BeautifulSoup 的最佳方法是什么？

我目前正在使用：

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys

from BeautifulSoup import BeautifulSoup

browser = webdriver.Firefox()
browser.get("http://www.yahoo.co.uk")
html = browser.find_elements_by_id("html")

但是“html”总是一个空列表。我究竟做错了什么？

score 4 · Accepted Answer

将页面源从 Selenium 传递给 Beautiful Soup 的正确方法是：

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys

from BeautifulSoup import BeautifulSoup

browser = webdriver.Firefox()
browser.get("http://www.yahoo.co.uk")
html_source = browser.page_source
html = BeautifulSoup(html_source)

这样，浏览器正在加载页面，提取完整的 html 源并将其传递给 BeautifulSoup。结果可以像任何其他 Beautiful Soup 对象一样被解析。

score 2 · Accepted Answer

你也可以使用类似的东西

html_source = browser.page_source

这是一个 webdriver 提供的函数调用，正是为了收集完整的源代码或“获取页面中的所有 html”

score 2 · Accepted Answer

HTML 不是 id。它应该是这样的：

html = browser.find_elements_by_tag_name("html")

因为 html 是一个标签。

您最初进行的搜索将返回 id 已设置为“html”的所有元素。将返回的元素示例：

<p id="html">Lorem ipsum</p>

该元素的 id 是“html”，标签名称是“p”。

python - HTML标签之间的硒

3 回答 3

Related

Reference