2

我正在尝试从可以通过网络(例如 Safari)浏览的文章中下载文本。

错误是:

newspaper.article.ArticleException: Article `download()` failed with 403 Client Error: Forbidden for url: https://www.newsweek.com/new-mexico-compound-charges-dropped-children-1096830 on URL https://www.newsweek.com/new-mexico-compound-charges-dropped-children-1096830

这是代码:

from newspaper import Article
from newspaper import Config

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15'
config = Config()

config.browser_user_agent = user_agent
url = "https://www.newsweek.com/new-mexico-compound-charges-dropped-children-1096830".strip()



page = Article(url, config=config)


page.download()
page.parse()
print(page.text)

就像您看到的那样,我尝试了此Stackoverflow 答案中的解决方案,但没有奏效。

完整的错误日志:

/Users/mona/anaconda3/bin/python /Users/mona/multimodal/newspaper_pg.py
Traceback (most recent call last):
  File "/Users/mona/multimodal/newspaper_pg.py", line 18, in <module>
    page.parse()
  File "/Users/mona/anaconda3/lib/python3.6/site-packages/newspaper/article.py", line 191, in parse
    self.throw_if_not_downloaded_verbose()
  File "/Users/mona/anaconda3/lib/python3.6/site-packages/newspaper/article.py", line 532, in throw_if_not_downloaded_verbose
    (self.download_exception_msg, self.url))
newspaper.article.ArticleException: Article `download()` failed with 403 Client Error: Forbidden for url: https://www.newsweek.com/new-mexico-compound-charges-dropped-children-1096830 on URL https://www.newsweek.com/new-mexico-compound-charges-dropped-children-1096830

Process finished with exit code 1

我从这个网站获得了我的用户代理信息:https ://developers.whatismybrowser.com/useragents/explore/operating_system_name/macos/

4

1 回答 1

2

对我来说正确的用户代理是Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0

你可以在这里找到你的:https ://www.whatismybrowser.com/detect/what-is-my-user-agent

from newspaper import Article
from newspaper import Config

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()

config.browser_user_agent = user_agent
url = "https://www.newsweek.com/new-mexico-compound-charges-dropped-children-1096830".strip()



page = Article(url, config=config)


page.download()
page.parse()
print(page.text)
于 2020-07-23T18:25:36.243 回答