我正在尝试从使用名为Newspaper的 python 库生成的一组链接中进行解析
目标:
解析来自新闻站点主页(或特定页面,如类别)的每个链接。
问题:
- 尝试将“article_link”传递给“Article()”方法时,我生成了一个 AttributeError。
- 使用单独的代码解析来自“纽约时报”的单个链接,打印的文本不会打印整篇文章。
代码生成问题1:
import newspaper
from newspaper import Article
nyt_paper = newspaper.build(
'http://nytimes.com/section/todayspaper', memoize_articles=False)
print(nyt_paper.size())
processed_link_list = []
for article_link in nyt_paper.articles:
article = Article(url=article_link)
article.download()
article.html
article.parse()
print(article.authors)
processed_link_list.append(article_link)
if len(nyt_paper.size()) is len(processed_link_list):
print('All Links Processed')
else:
print('All Links **NOT** Processed')
错误输出:
Traceback (most recent call last):
File "nyt_today.py", line 31, in <module>
article = Article(url=article_link)
File "C:\...\lib\site-packages\newspaper\article.py", line 60, in __init__
scheme = urls.get_scheme(url)
File "C:\...\lib\site-packages\newspaper\urls.py", line 279, in get_scheme
return urlparse(abs_url, **kwargs).scheme
File "C:\...\lib\urllib\parse.py", line 367, in urlparse
url, scheme, _coerce_result = _coerce_args(url, scheme)
File "C:\...\lib\urllib\parse.py", line 123, in _coerce_args
return _decode_args(args) + (_encode_result,)
File "C:\...\lib\urllib\parse.py", line 107, in _decode_args
return tuple(x.decode(encoding, errors) if x else '' for x in args)
File "C:\...\lib\urllib\parse.py", line 107, in <genexpr>
return tuple(x.decode(encoding, errors) if x else '' for x in args)
AttributeError: 'Article' object has no attribute 'decode'
代码生成问题2:
from newspaper import Article
from newspaper import fulltext
import requests
nyt_url = 'https://www.nytimes.com/2019/02/26/opinion/trump-kim-vietnam.html'
article = Article(nyt_url)
article.download()
print(article.html)
article.parse()
print(article.authors)
print(article.text)
我还尝试了文档中示例的这种“全文”方法来打印文本:
article_html = requests.get(nyt_url).text
full_text = fulltext(article_html)
print(full_text)
然而,尽管整篇文章文本输出到
print(article.html)
这
print(article.text)
不会全部打印出来。原始链接、HTML 输出和打印文本输出如下所示:
链接:https ://www.nytimes.com/2019/02/26/opinion/trump-kim-vietnam.html
Html 输出:查看此 pastebin 以获取截断的输出
印刷文字:见此印刷文字不印刷整篇文章
任何帮助将非常感激。