python - 我可以使用 pywikipedia 来获取页面的文本吗？

Question

是否有可能，使用 pywikipedia，只获取页面的文本，没有任何内部链接或模板，也没有图片等？

score 5 · Accepted Answer

如果您的意思是“我只想获取 wikitext”，请查看wikipedia.Page类和get方法。

import wikipedia

site = wikipedia.getSite('en', 'wikipedia')
page = wikipedia.Page(site, 'Test')

print page.get() # '''Test''', '''TEST''' or '''Tester''' may refer to:
#==Science and technology==
#* [[Concept inventory]] - an assessment to reveal student thinking on a topic.
# ...

这样您就可以从文章中获得完整的原始 wikitext。

如果你想去掉 wiki 语法，就像转换[[Concept inventory]]成概念清单等等，那会有点痛苦。

这个麻烦的主要原因是MediaWiki wiki 语法没有定义语法。这使得解析和剥离变得非常困难。我目前知道没有软件可以让你准确地做到这一点。当然还有 MediaWiki Parser 类，但它是 PHP，有点难以掌握，而且它的用途非常不同。

但是，如果您只想删除链接，或者非常简单的 wiki 构造使用正则表达式：

text = re.sub('\[\[([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor]] sit amet, consectetur adipiscing elit.')
print text #Lorem ipsum dolor sit amet, consectetur adipiscing elit.

然后对于管道链接：

text = re.sub('\[\[(?:[^\]\|]*)\|([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor|DOLOR]] sit amet, consectetur adipiscing elit.')
print text #Lorem ipsum DOLOR sit amet, consectetur adipiscing elit.

等等。

但是，例如，没有可靠的简单方法从页面中去除嵌套模板。评论中包含链接的图片也是如此。这非常困难，并且涉及递归地删除最内部的链接并用标记替换它并重新开始。templateWithParams如果需要，可以查看 wikipedia.py 中的函数，但它并不漂亮。

score 1 · Accepted Answer

Github 上有一个名为mwparserfromhell的模块，它可以根据您的需要让您非常接近您想要的。它有一个名为 strip_code() 的方法，可以去除很多标记。

import pywikibot
import mwparserfromhell

test_wikipedia = pywikibot.Site('en', 'test')
text = pywikibot.Page(test_wikipedia, 'Lestat_de_Lioncourt').get()

full = mwparserfromhell.parse(text)
stripped = full.strip_code()

print full
print '*******************'
print stripped

比较片段：

{{db-foreign}}
<!--  Commented out because image was deleted: [[Image:lestat_tom_cruise.jpg|thumb|right|[[Tom Cruise]] as Lestat in the film ''[[Interview With The Vampire: The Vampire Chronicles]]''|{{deletable image-caption|1=Friday, 11 April 2008}}]] -->

[[Image:lestat.jpg|thumb|right|[[Stuart Townsend]] as Lestat in the film ''[[Queen of the Damned (film)|Queen of the Damned]]'']]

[[Image:Lestat IWTV.jpg|thumb|right|[[Tom Cruise]] as Lestat in the 1994 film ''[[Interview with the Vampire (film)|Interview with the Vampire]]'']]

'''Lestat de Lioncourt''' is a [[fictional character]] appearing in several [[novel]]s by [[Anne Rice]], including ''[[The Vampire Lestat]]''. He is a [[vampire]] and the main character in the majority of ''[[The Vampire Chronicles]]'', narrated in first person.   

==Publication history==
Lestat de Lioncourt is the narrator and main character of the majority of the novels in Anne Rice's ''The Vampire Chronicles'' series. ''[[The Vampire Lestat]]'', the second book in the series, is presented as Lestat's autobiography, and follows his exploits from his youth in France to his early years as a vampire. Many of the other books in the series are also credited as being written by Lestat. 


*******************

thumb|right|Stuart Townsend as Lestat in the film ''Queen of the Damned''

'''Lestat de Lioncourt''' is a fictional character appearing in several novels by Anne Rice, including ''The Vampire Lestat''. He is a vampire and the main character in the majority of ''The Vampire Chronicles'', narrated in first person.   

Publication history
Lestat de Lioncourt is the narrator and main character of the majority of the novels in Anne Rice's ''The Vampire Chronicles'' series. ''The Vampire Lestat'', the second book in the series, is presented as Lestat's autobiography, and follows his exploits from his youth in France to his early years as a vampire. Many of the other books in the series are also credited as being written by Lestat.

score 0 · Accepted Answer

Pywikibot 能够删除任何 wikitext 或 html 标签。textlib 里面有两个函数：

移除 HTMLParts(text: str, keeptags=['tt', 'nowiki', 'small', 'sup']) -> str:

返回没有 HTML 标记被禁用的部分的文本，但在 html 标记之间保留文本。例如：
```
 from pywikibot Import textlib
 text = 'This is <small>small</small> text'
 print(removeHTMLParts(text, keeptags=[]))
```
这将打印：
```
 This is small text
```
removeDisabledParts(text: str, tags=None, include=[], site=None) -> str: 返回没有禁用 wiki 标记部分的文本。这将删除wikitext 文本中的文本。例如：
```
 from pywikibot Import textlib
 text = 'This is <small>small</small> text'
 print(removeDisabledPartsParts(text, tags=['small']))
```
这将打印：
```
 This is  text
```
有很多预定义的标签要删除或保留，例如 'comment', 'header', 'link', 'template'；

标签参数的默认值为['comment', 'includeonly', 'nowiki', 'pre', 'syntaxhighlight']

其他一些例子：

removeDisabledPartsParts('See [[this link]]', tags=['link'])给'See ' removeDisabledPartsParts('', tags=['comment'])给给'' removeDisabledPartsParts('{{Infobox}}', tags=['template'])了''，但仅适用于 Pywikibot 6.0.0 或更高版本

score 0 · Accepted Answer

您可以使用wikitextparser。例如：

import pywikibot
import wikitextparser
en_wikipedia = pywikibot.Site('en', 'wikipedia')
text = pywikibot.Page(en_wikipedia,'Bla Bla Bla').get()
print(wikitextparser.parse(text).sections[0].plain_text())

会给你：

"Bla Bla Bla" is a song written and recorded by Italian DJ Gigi D'Agostino. It heavily samples the vocals of "Why did you do it?" by British band Stretch. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. It was sampled in the song "Jump" from Lupe Fiasco's 2017 album Drogas Light.

python - 我可以使用 pywikipedia 来获取页面的文本吗？

4 回答 4

Related

Reference