java - Java：正则表达式删除列表的 wiki 标记

Question

我正在阅读一个维基百科 XML 文件，我必须在其中删除任何属于列表项的内容。例如对于以下字符串：

String text = ": definition list\n
** some list item\n
# another list item\n
[[Category:1918 births]]\n
[[Category:2005 deaths]]\n
[[Category:Scottish female singers]]\n
[[Category:Billy Cotton Band Show]]\n
[[Category:Deaths from Alzheimer's disease]]\n
[[Category:People from Glasgow]]";

在这里，我想删除*,#和:，但不是它所说的类别。输出应如下所示：

String outtext = "definition list\n
some list item\n
another list item\n
[[Category:1918 births]]\n
[[Category:2005 deaths]]\n
[[Category:Scottish female singers]]\n
[[Category:Billy Cotton Band Show]]\n
[[Category:Deaths from Alzheimer's disease]]\n
[[Category:People from Glasgow]]";

我正在使用以下代码：

Pattern pattern = Pattern.compile("(^\\*+|#+|;|:)(.+)$");
            Matcher matcher = pattern.matcher(text);
            while (matcher.find()) {
                String outtext = matcher.group(0);
                outtext = outtext.replaceAll("(^\\*+|#+|;|:)\\s", "");
                return(outtext);
                }

这是行不通的。你能指出我应该怎么做吗？

score 0 · Accepted Answer

这应该有效：

text = text.replaceAll("(?m)^[*:#]+\\s*", "");

重要的是在此处使用(?m)forMULTILINE模式，该模式可让您为每条线使用线开始/结束锚点。

输出：

definition list
some list item
another list item
[[Category:1918 births]]
[[Category:2005 deaths]]
[[Category:Scottish female singers]]
[[Category:Billy Cotton Band Show]]
[[Category:Deaths from Alzheimer's disease]]
[[Category:People from Glasgow]]

java - Java：正则表达式删除列表的 wiki 标记

1 回答 1

Related

Reference