python - 使用氦气进行动态网页抓取

Question

有一个网页包含多篇文章的链接，我希望能够访问这些文章中的每一篇并提取其中包含的文本。为此，我使用了 Helium Python 包并编写了一个脚本，但是，我一直遇到同样的错误。

下面给出的是我使用的脚本。我基本上是在尝试提取所有段落标签并从中创建一个 Word 文档。当我在一篇文章中对其进行测试时，它工作正常，但是，使用此循环会导致我遇到所述错误。

from helium import *
import time
from docx import Document
from docx.shared import Inches

document = Document()

start_chrome('some url', headless = True)

time.sleep(5)
article_list = find_all(S('a'))

for article in article_list:
    url = article.web_element.get_attribute('href')
    if url.startswith('some substring'):
        go_to(url)
        time.sleep(5)
        paragraph_list = find_all(S('p'))
        for paragraph in paragraph_list:
            document.add_paragraph(paragraph.web_element.text)

这是我不断收到的错误，

StaleElementReferenceException            Traceback (most recent call last)
<ipython-input-10-7a524350ae24> in <module>()
      1 for article in article_list:
----> 2     url = article.web_element.get_attribute('href')
      3     print(url)
      4     if url.startswith('some url'):
      5         go_to(url)

StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: headless chrome=86.0.4240.198)
  (Driver info: chromedriver=2.38.552522 (437e6fbedfa8762dec75e2c5b3ddb86763dc9dcb),platform=Windows NT 10.0.19041 x86_64)

我对网络抓取很陌生，所以我不知道我是否缺少一些简单的东西。在这里的任何帮助将不胜感激。

score 1 · Accepted Answer

我能够解决这个问题。我认为问题在于我收集的 URL 处于相对状态。解决此问题的更好方法是将所有 URL 收集到一个列表中，然后从那里开始，而不是通过迭代元素（文章）本身来生成它。代码如下，

from helium import *
import time
from docx import Document
from docx.shared import Inches

document = Document()

start_chrome('some url', headless = True)

time.sleep(5)
article_list = find_all(S('a'))

href_list = [article.web_element.get_attribute('href') for article in article_list]

for href in href_list:
    if href.startswith('some substring'):
        go_to(href)
        time.sleep(5)
        paragraph_list = find_all(S('p'))
        for paragraph in paragraph_list:
            document.add_paragraph(paragraph.web_element.text)

document.save('Extract.docx')

python - 使用氦气进行动态网页抓取

1 回答 1

Related

Reference