python - 如何使用 Pywikibot 获取 wiki 页面的 HTML？

Question

我正在使用 pywikibot-core，并且在另一个 python Mediawiki API 包装器之前使用了Wikipedia.py（它有一个 .HTML 方法）。我切换到 pywikibot-core 因为我认为它有更多功能，但我找不到类似的方法。（注意：我不是很熟练）。

score 5 · Accepted Answer

我将在此处发布 user283120 第二个答案，比第一个更精确：

Pywikibot 核心不支持与 Wiki 交互的任何直接 (HTML) 方式，因此您应该使用 API。如果需要，可以使用 urllib2 轻松完成。

这是我用来在 commons 中获取 wiki 页面的 HTML 的示例： import urllib2 ... url = "https://commons.wikimedia.org/wiki/" + page.title().replace(" ","_") html = urllib2.urlopen(url).read().decode('utf-8')

score 1 · Accepted Answer

“[saveHTML.py] 下载文章和图像的 HTML 页面，并将有趣的部分（即文章文本和页脚）保存到文件中”

来源：https ://git.wikimedia.org/blob/pywikibot%2Fcompat.git/HEAD/saveHTML.py

score 1 · Accepted Answer

IIRC 你想要整个页面的 HTML，所以你需要一些使用api.php?action=parse的东西。在 Python 中，我经常只使用wikitools来完成这样的事情，我不知道 PWB 或您的其他要求。

score 1 · Accepted Answer

一般来说，您应该使用 pywikibot 而不是 wikipedia（例如，而不是“import wikipedia”，您应该使用“import pywikibot”），如果您正在寻找 wikipedia.py 中的方法和类，它们现在是分开的并且可以找到在 pywikibot 文件夹中（主要在 page.py 和 site.py 中）

如果您想运行您在 compat 中编写的脚本，您可以使用 pywikibot-core 中名为 compat2core.py 的脚本（在脚本文件夹中），并且有一个名为 README-conversion.txt 的关于转换的详细帮助，请仔细阅读。

score 1 · Accepted Answer

Mediawiki API 有一个解析操作，允许获取由 Mediawiki 标记解析器返回的 wiki 标记的 html 片段。

对于pywikibot 库，已经实现了一个函数，您可以像这样使用它：

def getHtml(self,pageTitle):
        '''
        get the HTML code for the given page Title
        
        Args:
            pageTitle(str): the title of the page to retrieve
            
        Returns:
            str: the rendered HTML code for the page
        '''
        page=self.getPage(pageTitle)
        html=page._get_parsed_page()
        return html

使用mwclient python 库时，有一个通用的 api 方法，请参见： https ://github.com/mwclient/mwclient/blob/master/mwclient/client.py

可用于检索 html 代码，如下所示：

def getHtml(self,pageTitle):
        '''
        get the HTML code for the given page Title
        
        Args:
            pageTitle(str): the title of the page to retrieve
        '''
        api=self.getSite().api("parse",page=pageTitle)
        if not "parse" in api:
            raise Exception("could not retrieve html for page %s" % pageTitle)
        html=api["parse"]["text"]["*"]
        return html

如上所示，这给出了一个鸭子类型的接口，该接口在我是提交者的py-3rdparty-mediawiki库中实现。这已通过关闭问题 38 解决 - 添加 html 页面检索

score 0 · Accepted Answer

使用 Pywikibot，您可以http.request()用来获取 html 内容：

import pywikibot
from pywikibot.comms import http
site = pywikibot.Site('wikipedia:en')
page = pywikibot.Page(s, 'Elvis Presley')
path = '{}/index.php?title={}'.format(site.scriptpath(), page.title(as_url=True))
r = http.request(site, path)
print(r[94:135])

这应该给出html内容

'<title>Elvis Presley – Wikipedia</title>\n'

Pywikibot 6.0 http.request()提供了一个requests.Response对象而不是纯文本。在这种情况下，您必须使用 text 属性：

print(r.text[94:135])

得到相同的结果。

python - 如何使用 Pywikibot 获取 wiki 页面的 HTML？

6 回答 6

Related

Reference