wikipedia-api - 如何使用 Wikipedia 的 API 获取 Wikipedia 内容？

Question

我想获得维基百科文章的第一段。

这样做的 API 查询是什么？

score 64 · Accepted Answer

这些是关键参数。

prop=revisions&rvprop=content&rvsection=0

rvsection = 0 指定仅返回前导部分。

请参阅此示例。

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=pizza

要获取 HTML，您可以类似地使用action=parse http://en.wikipedia.org/w/api.php?action=parse§ion=0&prop=text&page=pizza

请注意，您必须删除任何模板或信息框。

score 41 · Accepted Answer

请参阅是否有仅用于检索内容摘要的 Wikipedia API？其他建议的解决方案。这是我建议的一个：

实际上有一个非常好的props叫做extracts，它可以用于专门为此目的设计的查询。提取允许您获取文章摘录（截断的文章文本）。有一个名为exintro的参数可用于检索第零部分中的文本（没有额外的资产，如图像或信息框）。您还可以检索具有更精细粒度的提取，例如按一定数量的字符 ( exchars ) 或按一定数量的句子 ( exsentences )

这是一个示例查询 http://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow 和API 沙箱 http://en.wikipedia.org/wiki/特别：ApiSandbox#action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow来试验更多这个查询。

请注意，如果您特别想要第一段，您仍然需要获取第一个标签。但是，在此 API 调用中，无需解析图像等其他资产。如果您对此介绍摘要感到满意，您可以通过运行诸如PHP 的 strip_tag 之类的删除 HTML 标记的函数来检索文本。

score 23 · Accepted Answer

我这样做：

https://en.wikipedia.org/w/api.php?action=opensearch&search=bee&limit=1&format=json

您得到的响应是一个包含数据的数组，易于解析：

[
  "bee",
  [
    "Bee"
  ],
  [
    "Bees are flying insects closely related to wasps and ants, known for their role in pollination and, in the case of the best-known bee species, the European honey bee, for producing honey and beeswax."
  ],
  [
    "https://en.wikipedia.org/wiki/Bee"
  ]
]

只得到第一段limit=1是你所需要的。

score 8 · Accepted Answer

获取文章的第一段：

https://en.wikipedia.org/w/api.php?action=query&titles=Belgrade&prop=extracts&format=json&exintro=1

我为自己的需要创建了简短的Wikipedia API 文档。有关于如何获取文章、图像和类似内容的工作示例。

score 5 · Accepted Answer

如果您需要对大量文章执行此操作，那么与其直接查询网站，不如考虑下载 Wikipedia 数据库转储，然后通过JWPL等 API 访问它。

score 4 · Accepted Answer

<script>    
    function dowiki(place) {
        var URL = 'https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=';

        URL += "&titles=" + place;
        URL += "&rvprop=content";
        URL += "&callback=?";
        $.getJSON(URL, function (data) {
            var obj = data.query.pages;
            var ob = Object.keys(obj)[0];
            console.log(obj[ob]["extract"]);
            try{
                document.getElementById('Label11').textContent = obj[ob]["extract"];
            }
            catch (err) {
                document.getElementById('Label11').textContent = err.message;
            }

        });
    }
</script>

score 4 · Accepted Answer

您可以通过查询https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles=java等页面获得维基百科中文章的介绍。您只需要解析 JSON 文件，结果是已清除的纯文本，包括删除链接和引用。

score 2 · Accepted Answer

您可以直接下载 Wikipedia 数据库并使用Wiki Parser将所有页面解析为 XML ，这是一个独立的应用程序。第一段是生成的 XML 中的一个单独节点。

或者，您可以从其纯文本输出中提取第一段。

score 2 · Accepted Answer

您可以extract_html为此使用摘要 REST 端点的字段：例如https://en.wikipedia.org/api/rest_v1/page/summary/Cat。

注意：这旨在通过删除大部分发音来简化内容，在某些情况下主要在括号中。

score 2 · Accepted Answer

你可以使用 jQuery 来做到这一点。首先使用适当的参数创建 URL。检查此链接以了解参数的含义。然后使用该$.ajax()方法检索文章。请注意，维基百科不允许跨源请求。这就是我们dataType : jsonp在请求中使用的原因。

var wikiURL = "https://en.wikipedia.org/w/api.php";
wikiURL += '?' + $.param({
    'action' : 'opensearch',
    'search' : 'your_search_term',
    'prop'  : 'revisions',
    'rvprop' : 'content',
    'format' : 'json',
    'limit' : 10
});

$.ajax({
    url: wikiURL,
    dataType: 'jsonp',
    success: function(data) {
        console.log(data);
    }
});

score 1 · Accepted Answer

假设keyword = "Batman" //Term you want to search，使用：

https://en.wikipedia.org/w/api.php?action=parse&page={{keyword}}&format=json&prop=text&section=0

以 JSON 格式从 Wikipedia 获取摘要/第一段。

score 1 · Accepted Answer

这是将转储法语和英语维基词典和维基百科的程序：

import sys
import asyncio
import urllib.parse
from uuid import uuid4

import httpx
import found
from found import nstore
from found import bstore
from loguru import logger as log

try:
    import ujson as json
except ImportError:
    import json


# XXX: https://github.com/Delgan/loguru
log.debug("That's it, beautiful and simple logging!")


async def get(http, url, params=None):
    response = await http.get(url, params=params)
    if response.status_code == 200:
        return response.content

    log.error("http get failed with url and reponse: {} {}", url, response)
    return None



def make_timestamper():
    import time
    start_monotonic = time.monotonic()
    start = time.time()
    loop = asyncio.get_event_loop()

    def timestamp():
        # Wanna be faster than datetime.now().timestamp()
        # approximation of current epoch time.
        out = start + loop.time() - start_monotonic
        out = int(out)
        return out

    return timestamp


async def wikimedia_titles(http, wiki="https://en.wikipedia.org/"):
    log.debug('Started generating asynchronously wiki titles at {}', wiki)
    # XXX: https://www.mediawiki.org/wiki/API:Allpages#Python
    url = "{}/w/api.php".format(wiki)
    params = {
        "action": "query",
        "format": "json",
        "list": "allpages",
        "apfilterredir": "nonredirects",
        "apfrom": "",
    }

    while True:
        content = await get(http, url, params=params)
        if content is None:
            continue
        content = json.loads(content)

        for page in content["query"]["allpages"]:
            yield page["title"]
        try:
            apcontinue = content['continue']['apcontinue']
        except KeyError:
            return
        else:
            params["apfrom"] = apcontinue


async def wikimedia_html(http, wiki="https://en.wikipedia.org/", title="Apple"):
    # e.g. https://en.wikipedia.org/api/rest_v1/page/html/Apple
    url = "{}/api/rest_v1/page/html/{}".format(wiki, urllib.parse.quote(title))
    out = await get(http, url)
    return wiki, title, out


async def save(tx, data, blob, doc):
    uid = uuid4()
    doc['html'] = await bstore.get_or_create(tx, blob, doc['html'])

    for key, value in doc.items():
        nstore.add(tx, data, uid, key, value)

    return uid


WIKIS = (
    "https://en.wikipedia.org/",
    "https://fr.wikipedia.org/",
    "https://en.wiktionary.org/",
    "https://fr.wiktionary.org/",
)

async def chunks(iterable, size):
    # chunk async generator https://stackoverflow.com/a/22045226
    while True:
        out = list()
        for _ in range(size):
            try:
                item = await iterable.__anext__()
            except StopAsyncIteration:
                yield out
                return
            else:
                out.append(item)
        yield out


async def main():
    # logging
    log.remove()
    log.add(sys.stderr, enqueue=True)

    # singleton
    timestamper = make_timestamper()
    database = await found.open()
    data = nstore.make('data', ('sourcery-data',), 3)
    blob = bstore.make('blob', ('sourcery-blob',))

    async with httpx.AsyncClient() as http:
        for wiki in WIKIS:
            log.info('Getting started with wiki at {}', wiki)
            # Polite limit @ https://en.wikipedia.org/api/rest_v1/
            async for chunk in chunks(wikimedia_titles(http, wiki), 200):
                log.info('iterate')
                coroutines = (wikimedia_html(http, wiki, title) for title in chunk)
                items = await asyncio.gather(*coroutines, return_exceptions=True)
                for item in items:
                    if isinstance(item, Exception):
                        msg = "Failed to fetch html on `{}` with `{}`"
                        log.error(msg, wiki, item)
                        continue
                    wiki, title, html = item
                    if html is None:
                        continue
                    log.debug(
                        "Fetch `{}` at `{}` with length {}",
                        title,
                        wiki,
                        len(html)
                    )

                    doc = dict(
                        wiki=wiki,
                        title=title,
                        html=html,
                        timestamp=timestamper(),
                    )

                    await found.transactional(database, save, data, blob, doc)


if __name__ == "__main__":
    asyncio.run(main())

获取 wikimedia 数据的另一种方法是依赖 kiwix zim 转储。

wikipedia-api - 如何使用 Wikipedia 的 API 获取 Wikipedia 内容？

12 回答 12

Related

Reference