rss - 如何在python中读取网站内容

Question

我正在尝试编写一个程序来读取任何网站的文章（帖子），这些网站的范围可能来自 Blogspot 或 Wordpress 博客/任何其他网站。至于编写与几乎所有可能用 HTML5/XHTML 等编写的网站兼容的代码。我想使用 RSS/Atom 提要作为提取内容的基础。

但是，由于 RSS/Atom 提要通常可能不包含网站的整篇文章，我想从提要中收集所有“帖子”链接feedparser，然后想从相应的 URL 中提取文章内容。

我可以获得网站中所有文章的 URL（包括摘要。即提要中显示的文章内容），但我想访问我必须使用相应 URL 的整个文章数据。

我遇到了各种库，例如BeautifulSoup，lxml等等。（各种 HTML/XML 解析器），但我真的不知道如何获得文章的“确切”内容（我假设“确切”是指具有所有超链接、iframe、幻灯片等仍然存在；我不想要 CSS 部分）。

那么，任何人都可以帮助我吗？

score 3 · Accepted Answer

获取所有链接页面的 HTML 代码非常容易。

困难的部分是准确提取您正在寻找的内容。如果您只需要标签内的所有代码<body>，这也不应该是一个大问题；提取所有文本同样简单。但是如果你想要一个更具体的子集，你还有更多的工作要做。

我建议您下载 requests 和 BeautifulSoup 模块（都可以通过easy_install requests/bs4或更好pip install requests/bs4）。requests 模块使获取页面变得非常容易。

以下示例获取 rss 提要并返回三个列表：

linksoups是从提要链接的每个页面的BeautifulSoup实例的列表
linktexts是从提要链接的每个页面的可见文本列表
linkimageurls是一个列表列表，其中src包含从提要链接的每个页面中嵌入的所有图像的 -url
- 例如[['/pageone/img1.jpg', '/pageone/img2.png'], ['/pagetwo/img1.gif', 'logo.bmp']]

import requests, bs4

# request the content of the feed an create a BeautifulSoup object from its content
response = requests.get('http://rss.slashdot.org/Slashdot/slashdot')
responsesoup = bs4.BeautifulSoup(response.text)

linksoups = []
linktexts = []
linkimageurls = []

# iterate over all <link>…&lt;/link> tags and fill three lists: one with the soups of the
# linked pages, one with all their visible text and one with the urls of all embedded
# images
for link in responsesoup.find_all('link'):
    url = link.text
    linkresponse = requests.get(url) # add support for relative urls with urlparse
    soup = bs4.BeautifulSoup(linkresponse.text)
    linksoups.append(soup)

    linktexts.append(soup.find('body').text)
    # Append all text between tags inside of the body tag to the second list

    images = soup.find_all('img')
    imageurls = []
    # get the src attribute of each <img> tag and append it to imageurls
    for image in images:
        imageurls.append(image['src'])
    linkimageurls.append(imageurls)

# now somehow merge the retrieved information.

这可能是您项目的粗略起点。

rss - 如何在python中读取网站内容

1 回答 1

Related

Reference