python - 如何从维基百科中获取纯文本

Question

我想编写一个仅获取 Wikipedia 描述部分的脚本。也就是说，当我说

/wiki bla bla bla

它将转到Wikipedia 页面bla bla bla，获取以下内容，并将其返回到聊天室：

“Bla Bla Bla”是 Gigi D'Agostino 创作的一首歌曲的名称。他将这首歌描述为“我写的一首考虑到所有说话和说话却不说话的人的作品”。突出但荒谬的人声样本取自英国乐队 Stretch 的歌曲“Why Did You Do It”

我怎样才能做到这一点？

score 36 · Accepted Answer

这里有几种不同的可能方法；使用适合您的。我下面的所有代码示例都requests用于对 API 的 HTTP 请求；requests如果你有pip install requestsPip ，你可以安装。它们也都使用Mediawiki API，其中两个使用查询端点；如果您需要文档，请点击这些链接。

`extracts`1. 使用prop直接从 API 中获取整个页面或页面“提取”的纯文本表示

请注意，此方法仅适用于具有TextExtracts 扩展名的 MediaWiki 站点。这尤其包括维基百科，但不包括一些较小的 Mediawiki 网站，例如http://www.wikia.com/

您想点击类似的网址

https://en.wikipedia.org/w/api.php?action=query&format=json&titles=Bla_Bla_Bla&prop=extracts&exintro&explaintext

打破它，我们在那里有以下参数（记录在https://www.mediawiki.org/wiki/Extension:TextExtracts#query+extracts）：

action=query, format=json, 和title=Bla_Bla_Bla都是标准的 MediaWiki API 参数
prop=extracts让我们使用 TextExtracts 扩展
exintro限制对第一节标题之前的内容的响应
explaintext使响应中的提取成为纯文本而不是 HTML

然后解析 JSON 响应并提取提取：

>>> import requests
>>> response = requests.get(
...     'https://en.wikipedia.org/w/api.php',
...     params={
...         'action': 'query',
...         'format': 'json',
...         'titles': 'Bla Bla Bla',
...         'prop': 'extracts',
...         'exintro': True,
...         'explaintext': True,
...     }
... ).json()
>>> page = next(iter(response['query']['pages'].values()))
>>> print(page['extract'])
"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

2.使用端点获取页面的完整HTML `parse`，解析，提取第一段

MediaWiki 有一个parse端点，您可以使用https://en.wikipedia.org/w/api.php?action=parse&page=Bla_Bla_Bla之类的 URL来获取页面的 HTML。然后，您可以使用诸如lxml之类的 HTML 解析器对其进行解析（首先使用安装它pip install lxml）以提取第一段。

例如：

>>> import requests
>>> from lxml import html
>>> response = requests.get(
...     'https://en.wikipedia.org/w/api.php',
...     params={
...         'action': 'parse',
...         'page': 'Bla Bla Bla',
...         'format': 'json',
...     }
... ).json()
>>> raw_html = response['parse']['text']['*']
>>> document = html.document_fromstring(raw_html)
>>> first_p = document.xpath('//p')[0]
>>> intro_text = first_p.text_content()
>>> print(intro_text)
"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

3. 自己解析wikitext

您可以使用queryAPI 获取页面的 wikitext，使用解析它mwparserfromhell（首先使用安装它pip install mwparserfromhell），然后使用strip_code. strip_code在撰写本文时并不完美（如下面的示例中清楚显示），但希望会有所改善。

>>> import requests
>>> import mwparserfromhell
>>> response = requests.get(
...     'https://en.wikipedia.org/w/api.php',
...     params={
...         'action': 'query',
...         'format': 'json',
...         'titles': 'Bla Bla Bla',
...         'prop': 'revisions',
...         'rvprop': 'content',
...     }
... ).json()
>>> page = next(iter(response['query']['pages'].values()))
>>> wikicode = page['revisions'][0]['*']
>>> parsed_wikicode = mwparserfromhell.parse(wikicode)
>>> print(parsed_wikicode.strip_code())
{{dablink|For Ke$ha's song, see Blah Blah Blah (song). For other uses, see Blah (disambiguation)}}

"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

Background and writing
He described this song as "a piece I wrote thinking of all the people who talk and talk without saying anything". The prominent but nonsensical vocal samples are taken from UK band Stretch's song "Why Did You Do It"''.

Music video
The song also featured a popular music video in the style of La Linea. The music video shows a man with a floating head and no arms walking toward what appears to be a shark that multiplies itself and can change direction. This style was also used in "The Riddle", another song by Gigi D'Agostino, originally from British singer Nik Kershaw.

Chart performance
Chart (1999-00)PeakpositionIreland (IRMA)Search for Irish peaks23

References

External links


Category:1999 singles
Category:Gigi D'Agostino songs
Category:1999 songs
Category:ZYX Music singles
Category:Songs written by Gigi D'Agostino

score 23 · Accepted Answer

使用在 Wikipedia 上运行的MediaWiki API 。您将不得不自己对数据进行一些解析。

例如：

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&&titles=Bla%20Bla%20Bla

方法

以 JSON 格式 (format=json) 获取 (action=query) 最新版本的 Main Page (title=Main%20Page) 的内容 (rvprop=content)。

您可能希望搜索查询并使用第一个结果来处理拼写错误等。

score 12 · Accepted Answer

您可以获取文本格式的 wiki 数据。如果您需要访问多个标题的信息，您可以在一次调用中获取所有标题的 wiki 数据。使用竖线 (|) 分隔每个标题。

http://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exlimit=max&explaintext&exintro&titles=Yahoo|Google&redirects=

在这里，这个 api 调用返回 Googles 和 Yahoos 数据。

explaintext=> 返回提取为纯文本而不是有限的 HTML。

exlimit = max（现在是 20 个）；否则只会返回一个结果。

exintro=> 只返回第一部分之前的内容。如果你想要完整的数据，只需删除它。

redirects= 解决重定向问题。

score 7 · Accepted Answer

您可以使用 API 仅获取第一部分：

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvsection=0&titles=Bla%20Bla%20Bla&rvprop=content

这将为您提供原始 wikitext，您将不得不处理模板和标记。

或者，您可以获取呈现为 HTML 的整个页面，就解析而言，它有自己的优点和缺点：

http://en.wikipedia.org/w/api.php?action=parse&prop=text&page=Bla_Bla_Bla

我看不到在一次调用中获取第一部分的解析 HTML 的简单方法，但您可以通过两次调用来完成，方法是将您从第一个 URL 接收到的 wikitext 传回，text=而不是page=第二个 URL 中的。

更新

抱歉，我忽略了您问题的“纯文本”部分。以 HTML 格式获取您想要的文章部分。剥离 HTML 比剥离wikitext容易得多！

score 4 · Accepted Answer

DBPedia 是这个问题的完美解决方案。在这里：http ://dbpedia.org/page/Metallica ，查看使用 RDF 完美组织的数据。可以使用 SPARQL（RDF 的查询语言）在http://dbpedia.org/sparql上查询任何内容。总有一种方法可以找到 pageID 以获取描述性文本，但这在大多数情况下都应该这样做。

RDF 和 SPARQL 将有一个学习曲线来编写任何有用的代码，但这是完美的解决方案。

例如，针对Metallica运行的查询会返回一个 HTML 表格，其中包含几种不同语言的摘要：

<table class="sparql" border="1">
  <tr>
    <th>abstract</th>
  </tr>
  <tr>
    <td><pre>"Metallica is an American heavy metal band formed..."@en</pre></td>
  </tr>
  <tr>
    <td><pre>"Metallica es una banda de thrash metal estadounidense..."@es</pre></td>
...

SPARQL 查询：

PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
PREFIX dbpprop: <http://dbpedia.org/property/>
PREFIX dbres: <http://dbpedia.org/resource/>

SELECT ?abstract WHERE {
 dbres:Metallica dbpedia-owl:abstract ?abstract.
}

将“Metallica”更改为任何资源名称（资源名称在 wikipedia.org/resourcename 中）以查询与摘要相关的内容。

score 2 · Accepted Answer

Alternatively, you can try to load any of the text of wiki pages simply like this https://bn.wikipedia.org/w/index.php?title=User:ShohagS&action=raw&ctype=text

where change bn to you your wiki language and User:ShohagS will be the page name. In your case use: https://en.wikipedia.org/w/index.php?title=Bla_bla_bla&action=raw&ctype=text

in browsers, this will return a php formated text file.

score 1 · Accepted Answer

我认为更好的选择是使用extracts为您提供 MediaWiki API 的道具。它只返回一些标签（b、i、h#、span、ul、li）并删除表格、信息框、引用等。

http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Bla%20Bla%20Bla&format=xml 给你一些非常简单的东西：

<api><query><pages><page pageid="4456737" ns="0" title="Bla Bla Bla"><extract xml:space="preserve">
<p>"<b>Bla Bla Bla</b>" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, <i>L'Amour Toujours</i>. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with <i>L'Amour Toujours (I'll Fly With You)</i> in its US radio version.</p> <p></p> <h2><span id="Background_and_writing">Background and writing</span></h2> <p>He described this song as "a piece I wrote thinking of all the people who talk and talk without saying anything". The prominent but nonsensical vocal samples are taken from UK band Stretch's song <i>"Why Did You Do It"</i>.</p> <h2><span id="Music_video">Music video</span></h2> <p>The song also featured a popular music video in the style of La Linea. The music video shows a man with a floating head and no arms walking toward what appears to be a shark that multiplies itself and can change direction. This style was also used in "The Riddle", another song by Gigi D'Agostino, originally from British singer Nik Kershaw.</p> <h2><span id="Chart_performance">Chart performance</span></h2> <h2><span id="References">References</span></h2> <h2><span id="External_links">External links</span></h2> <ul><li>Full lyrics of this song at MetroLyrics</li> </ul>
</extract></page></pages></query></api>

然后你可以通过正则表达式运行它，在 JavaScript 中会是这样的（也许你必须做一些小的修改：

/^.*<\s*extract[^>]*\s*>\s*((?:[^<]*|<\s*\/?\s*[^>hH][^>]*\s*>)*).*<\s*(?:h|H).*$/.exec(data)

这给了你（只有段落，粗体和斜体）：

“ Bla Bla Bla ”是意大利 DJ Gigi D'Agostino 创作并录制的一首歌曲的标题。它于 1999 年 5 月作为专辑L'Amour Toujours的第三首单曲发行。它在奥地利排名第 3，在法国排名第 15。这首歌也可以在其美国广播版本中与L'Amour Toujours (I'll Fly With You)的混搭中听到。

score 0 · Accepted Answer

您可以尝试 WikiExtractor：http ://medialab.di.unipi.it/wiki/Wikipedia_Extractor

它适用于 Python 2.7 和 3.3+。

score 0 · Accepted Answer

“...仅获取 Wikipedia 描述部分的脚本...”

对于您的应用程序，您可能会在转储上查看什么，例如：http ://dumps.wikimedia.org/enwiki/20120702/

您需要的特定文件是“抽象”XML 文件，例如，这个小文件 (22.7MB)：

http://dumps.wikimedia.org/enwiki/20120702/enwiki-20120702-abstract19.xml

XML 有一个名为“abstract”的标签，其中包含每篇文章的第一部分。

否则 wikipedia2text 使用例如 w3m 来下载带有扩展和格式化为文本的模板的页面。从中您可以通过正则表达式挑选出摘要。

score 0 · Accepted Answer

首先检查这里。

MediaWiki 的文本标记中有很多无效的语法。（用户犯的错误......）只有 MediaWiki 可以解析这个地狱般的文本。但是在上面的链接中仍然有一些替代方法可以尝试。不完美，但总比没有好！

score 0 · Accepted Answer

您可以使用wikipediaPython 的包，特别是content给定页面的属性。

从文档中：

>>> import wikipedia
>>> print wikipedia.summary("Wikipedia")
# Wikipedia (/ˌwɪkɨˈpiːdiə/ or /ˌwɪkiˈpiːdiə/ WIK-i-PEE-dee-ə) is a collaboratively edited, multilingual, free Internet encyclopedia supported by the non-profit Wikimedia Foundation...

>>> wikipedia.search("Barack")
# [u'Barak (given name)', u'Barack Obama', u'Barack (brandy)', u'Presidency of Barack Obama', u'Family of Barack Obama', u'First inauguration of Barack Obama', u'Barack Obama presidential campaign, 2008', u'Barack Obama, Sr.', u'Barack Obama citizenship conspiracy theories', u'Presidential transition of Barack Obama']
>>> ny = wikipedia.page("New York")
>>> ny.title
# u'New York'
>>> ny.url
# u'http://en.wikipedia.org/wiki/New_York'
>>> ny.content
# u'New York is a state in the Northeastern region of the United States. New York is the 27th-most exten'...

score -1 · Accepted Answer

还有机会通过JSONpedia之类的包装 API 使用 Wikipedia 页面，它既可以实时工作（询问 Wiki 页面的当前 JSON 表示形式），也可以基于存储（查询以前在 Elasticsearch 和 MongoDB 中摄取的多个页面）。输出 JSON 还包括纯呈现的页面文本。

score -1 · Accepted Answer

您可以尝试 Python 的 BeautifulSoup HTML 解析库，但您必须编写一个简单的解析器。

python - 如何从维基百科中获取纯文本

13 回答 13

extracts1. 使用prop直接从 API 中获取整个页面或页面“提取”的纯文本表示

2.使用端点获取页面的完整HTML parse，解析，提取第一段

3. 自己解析wikitext

Related

Reference

`extracts`1. 使用prop直接从 API 中获取整个页面或页面“提取”的纯文本表示

2.使用端点获取页面的完整HTML `parse`，解析，提取第一段