0

对,我要从我从 Wikipedia 下载的 xml 文件中删除一些引号。到目前为止文本看起来像这样(忽略换行符,这只是为了更容易阅读):

'''Anarchism''' is a political philosophy that advocates stateless societies based on 
non-hierarchical free associations.<ref name="iaf-ifa.org"/><ref>"That is why 
Anarchy, when it works to destroy authority in all its aspects, when it demands
 the abrogation of laws and the abolition of the mechanism that serves to
 impose them, when it refuses all hierarchical organization and preaches free agreement - at the same time strives to maintain and enlarge the precious kernel of social customs without which
 no human or animal society can exist." Peter Kropotkin. http://www.theanarchistlibrary.org/HTML/Petr_Kropotkin__Anarchism__its_philosophy_and_ideal.html
 Anarchism: its philosophy and ideal</ref><ref>"anarchists are opposed to irrational (e.g., illegitimate) 
authority, in other words, hierarchy - hierarchy being the institutionalisation of authority 
within a society." http://www.theanarchistlibrary.org/HTML/The_Anarchist_FAQ_Editorial_Collective__An_Anarchist_FAQ__03_17_.html#toc2 "B.1 
Why are anarchists against authority and hierarchy?" in An 
Anarchist FAQ</ref><ref>"ANARCHISM, a social philosophy that rejects
 authoritarian government and maintains that voluntary institutions are best
 suited to express man's natural social tendencies." George Woodcock. "Anarchism" at The Encyclopedia of Philosophy</ref><ref>"In a society developed on these lines, the voluntary 
associations which already now begin to cover all the fields of human activity
 would take a still greater extension so as to substitute themselves for the 
state in all its functions." http://www.theanarchistlibrary.org/HTML/Petr_Kropotkin___Anarchism__from_the_Encyclopaedia_Britannica.html
 Peter Kropotkin. "Anarchism" from the Encyclopædia Britannica</ref> Anarchism holds the state
 to be undesirable, unnecessary, or harmful

我想要从这段文字中得到的只是:

无政府主义是一种政治哲学,它提倡基于非等级制自由协会的无国籍社会。无政府主义认为国家是不受欢迎的、不必要的或有害的。

在我看来,如果我删除之间的所有文本"<ref""/ref>"我应该能够捕获所有需要的不需要的文本并将其删除。这是我目前的代码:

        Dim temptext As String = newsrt.ToString
        Dim expression As New Regex("(?<=\<ref)[^/ref>]+(?=/ref>)")
        Dim resul As String = expression.Replace(temptext, "")

但这似乎不起作用。<ref和之间没有文本/ref>被捕获并替换为“”。

任何帮助或建议都会很棒!谢谢。

4

1 回答 1

2

这不是否定字符类的工作方式。该类不允许任何单个字符/, r, e, f, >。此外,您甚至根本不想排除/ref>,因为您也想删除所有中间refs。您可以简单地使用.*. 此外,您不需要环视,因为它们会从匹配中排除其中匹配的内容。但是您确实也想删除这些标签。因此,在您的情况下,它应该很简单:

"<ref.*/ref>"

由于*是贪婪的,所以这场比赛只会从第一个<ref到最后一个/ref>- 通常是一个很大的贪婪问题,但在你的特定情况下正是你想要的。

RegexOptions.Singleline如果有的话,您可能想要使用.匹配换行符。

于 2013-08-23T09:16:53.427 回答