python - python报纸 - 如果URL不是英文，则无法提取文章

Question

我正在尝试使用 python报纸模块获取新闻文章的内容。我可以使用以下代码找到新闻项目的正文。该代码使用feedparserfeed_url解析变量中的提要 URL ，然后尝试使用报纸模块查找新闻正文和发布日期。

import newspaper
from newspaper import Article
import feedparser
import urllib.parse

count = 0
feed_url="https://www.extremetech.com/feed"
#feed_url="http://www.prothomalo.com/feed/"
d = feedparser.parse(feed_url)
for post in d.entries:
    count+=1
    if count == 2:
        break

    #post_link = post.link
    post_link =urllib.parse.unquote(post.link) #Added later to decode the
    # encoded URL into the  original Bengali langauge            
    print("count= ",count," url = ",post_link,end="\n ")

    try:

        content = Article(post_link)
        content.download()
        content.parse()
        print(" content = ", end=" ")
        print(content.text[0:50])
        print(" content.publish_date = {}".format(content.publish_date))


    except Exception as e:
        print(e)

我在代码中提到了变量的 2 个不同值feed_url- 一个来自extremetch网站，另一个来自prothomalo网站。

例如，extremetech 有一个新闻项目（我通过它feedparser.parse ），其 URL 为 https://www.extremetech.com/computing/263951-mit-announces-new-neural-network-processor-cuts-power-消费-95。我可以轻松获取此 URL 的新闻正文和发布日期。

但是例如 prothomalo 有一个新闻项目，其 URL（来自feedparser.parse）为http://www.prothomalo.com/sports/article/1432086/%E0%A6%B8%E0%A6%B0%E0%A7%8D %E0%A6%AC%E0%A7%8B%E0%A6%9A%E0%A7%8D%E0%A6%9A-%E0%A6%B8%E0%A7%8D%E0%A6%95% E0%A7%8B%E0%A6%B0-%E0%A6%97%E0%A7%9C%E0%A7%87%E0%A6%93-%E0%A6%B9%E0%A6%BE% E0%A6%B0。

但实际的 URL 在 prothomalo 网站上看起来并不如此。您可以访问该 URL，会发现该 URL 已更改为孟加拉语。我认为这种加密 (?) URL 背后的原因是 URL 有一些部分是孟加拉语的。这里的内容也是孟加拉语。

Python 报纸模块可以从 extretemetech 网站而不是从 prothomalo 中提取内容和发布日期。失败是由于 prothomalo URL 中的非英文字符造成的吗？

我怎样才能从 prothomalo 站点（即可能包含非英语 URL 的站点）获取新闻内容、发布日期等？

编辑 1： 我可以使用以下行将 prothomalo 的编码 URL 解码为原始孟加拉语：post_link =urllib.parse.unquote(post.link)。我仍然无法获得内容和发布日期。

python - python报纸 - 如果URL不是英文，则无法提取文章

0 回答 0

Related

Reference