1

我有一个包含从 Wikia 页面下载的信息的字符串。

为了解析其内容,我将如何从页面中删除所有 Wiki 格式,只留下原始文本?

以下是可能出现的示例:

#REDIRECT[[Blah]]

{{
I have some stuff in here
}}
[[I also have some stuff in here|and here]]
[[http://blehthisisfake.com Link to a fake website]]

<span class="plainlinks">This is quite useless. Why was [[this page]] even created?</span>

<nowiki>There are more HTML tags, they should probably all be stripped...</nowiki>

There is random text in here. bleh bleh bleh

I'm not sure what single [brackets] do, but they should be stripped too...

预期输出:

这里有随机文本。呜呜呜

我不确定单身做什么,但他们也应该被剥夺......

有没有可以做到这一点的模块?

4

1 回答 1

3

A Google search for "python wiki parser" turns up this code, which strips and replaces the tags (see the source code in the link for details).

于 2012-06-16T04:44:47.513 回答