1

如果有办法使用 C# 从文本中删除所有 MediaWiki 标记“代码”?

例如,我有以下文本:

<h2><span class="editsection">[<a href="/w/index.php?title=Roger_Zelazny&amp;action=edit&amp;section=1" title="Edit section: Biography">edit</a>]</span> <span class="mw-headline" id="Biography">Biography</span></h2>
<p>Roger Zelazny was born in <a href="/wiki/Euclid,_Ohio" title="Euclid, Ohio">Euclid, Ohio</a>, the only child of Polish immigrant Joseph Frank Zelazny and <a href="/wiki/Irish-American" title="Irish-American" class="mw-redirect">Irish-American</a> Josephine Flora Sweet. In high school, he became the editor of the school newspaper and joined the Creative Writing Club.<sup id="cite_ref-Roger_Zelazny_2009_0-0" class="reference">
<a href="#cite_note-Roger_Zelazny_2009-0"><span>[</span>1<span>]</span></a></sup> In the fall of 1955, he began attending <a href="/wiki/Case_Western_Reserve_University" title="Case Western Reserve University">Western Reserve University</a> and graduated with a B.A. in English in 1959.<sup id="cite_ref-Roger_Zelazny_2009_0-1" class="reference"><a href="#cite_note-Roger_Zelazny_2009-0"><span>[</span>1<span>]</span></a></sup> He was accepted to <a href="/wiki/Columbia_University" title="Columbia University">Columbia University</a> in New York and specialized in Elizabethan and Jacobean drama, graduating with an M.A. in 1962.<sup id="cite_ref-Roger_Zelazny_2009_0-2" class="reference">
<a href="#cite_note-Roger_Zelazny_2009-0"><span>[</span>1<span>]</span></a></sup> His M.A. thesis was entitled <i>Two traditions and <a href="/wiki/Cyril_Tourneur" title="Cyril Tourneur">Cyril Tourneur</a>: an examination of morality and humor comedy conventions in</i> <a href="/wiki/The_Revenger%27s_Tragedy" title="The Revenger's Tragedy">The Revenger's Tragedy</a>. Between 1962 and 1969 he worked for the U.S. <a href="/wiki/Social_Security_Administration" title="Social Security Administration">Social Security Administration</a> in <a href="/wiki/Cleveland,_Ohio" title="Cleveland, Ohio" class="mw-redirect">Cleveland, Ohio</a> and then in <a href="/wiki/Baltimore,_Maryland" title="Baltimore, Maryland" class="mw-redirect">Baltimore, Maryland</a> spending his evenings writing science fiction.<sup id="cite_ref-Roger_Zelazny_2009_0-3" class="reference"><a href="#cite_note-Roger_Zelazny_2009-0"><span>[</span>1<span>]</span></a></sup><sup id="cite_ref-AndCall_1-0" class="reference"><a href="#cite_note-AndCall-1"><span>[</span>2<span>]</span></a></sup> 
He deliberately progressed from short-shorts to novelettes to novellas and finally to novel-length works by 1965.<sup id="cite_ref-Roger_Zelazny_2009_0-4" class="reference"><a href="#cite_note-Roger_Zelazny_2009-0"><span>[</span>1<span>]</span></a></sup> On May 1, 1969, he quit to become a full-time writer, and thereafter concentrated on writing novels in order to maintain his income.<sup id="cite_ref-AndCall_1-1" class="reference"><a href="#cite_note-AndCall-1"><span>[</span>2<span>]</span></a></sup>
During this period, he was an active and vocal member of the Baltimore Science Fiction Society, whose members included writers <a href="/wiki/Jack_Chalker" title="Jack Chalker" class="mw-redirect">Jack Chalker</a> and <a href="/wiki/Joe_Haldeman" title="Joe Haldeman">Joe</a> and <a href="/wiki/Jack_Haldeman" title="Jack Haldeman" class="mw-redirect">Jack Haldeman</a> among others.</p>

以下 Html 表示:

[编辑] 传记

罗杰·泽拉兹尼出生在俄亥俄州的欧几里得,是波兰移民约瑟夫·弗兰克·泽拉兹尼和爱尔兰裔美国人约瑟芬·弗洛拉·斯威特的独生子。高中时,他成为校报的编辑,并加入了创意写作俱乐部。[1] 1955 年秋天,他开始就读西储大学,并于 1959 年获得英语学士学位。 [1] 他被纽约哥伦比亚大学录取,专攻伊丽莎白时代和雅各布时代的戏剧,并于 1962 年获得硕士学位。 [1] 他的硕士学位论文题为“两种传统和Cyril Tourneur:审查 复仇者的悲剧中的道德和幽默喜剧惯例” . 1962 年至 1969 年间,他在美国工作俄亥俄州克利夫兰的社会保障局和马里兰州巴尔的摩的社会保障局晚上都在写科幻小说。[1][2] 他刻意从短篇小说发展到中篇小说,再到中篇小说,最后到 1965 年创作长篇小说。 [1] 1969年5月1日,他辞去专职作家的职务,此后专心写小说以维持收入。 [2] 在此期间,他是巴尔的摩科幻协会的活跃和直言不讳的成员,该协会的成员包括作家杰克·查克、和杰克·霍尔德曼等。

我正在寻找一种方法不仅可以去除 HTML 标签,还可以去除参考、维基“链接”等内容 - 我想删除维基百科完成的所有格式和“处理”,只保留文本...

4

1 回答 1

0

解析 HTML 不会让您走得太远,因为几乎不可能分辨什么是“内容”,什么不是。您需要的是一个 MediaWiki 标记解析器,虽然有几十个,但mediawiki.org 上的规范列表(在撰写本文时)似乎没有任何 C#。

如果您最终调用任何外部库,mwlib可能是最成熟的。

于 2012-09-20T06:31:21.747 回答