1

问题:我需要在java中匹配一个大文本(维基百科转储由xml页面组成)中的内容。
所需内容:Infobox
Reg ex used :"\\{\\{Infobox(.*?)\\}\\}"

问题:上述模式与信息框中第一次出现的 }} 匹配,如果我删除 ? reg ex 中的字符,模式匹配最后一次出现。但是,我正在寻找只提取信息框和 }} 应该匹配信息框的结尾。

前信息框:

{{infobox RPG
|title= Amber Diceless Roleplaying Game
|image= [[Image:Amber DRPG.jpg|200px]]
|caption= Cover of the main ''Amber DRPG'' rulebook (art by [[Stephen Hickman]])
|designer= [[Erick Wujcik]]
|publisher= [[Phage Press]]<br>[[Guardians of Order]]
|date= 1991
|genre= [[Fantasy]]
|system= Custom (direct comparison of statistics without dice)
|footnotes= 
}}

代码片段:

String regex = "\\{\\{Infobox(.*?)\\}\\}";
Pattern p1 = Pattern.compile(regex, Pattern.DOTALL);
Matcher m1 = p1.matcher(xmlPage.getText());
String workgroup = "";
while(m1.find()){
    workgroup = m1.group();
}
4

4 回答 4

1

你应该试试这个正则表达式:

字符串正则表达式 = "\\{\\{[Ii]nfobox([^\\}].*\\n+)*\\}\\}";

或者

模式 pattern = Pattern.compile("\\{\\{[Ii]nfobox([^\\}].*\\n+)*\\}\\}");

说明:上面的正则表达式查找
1 。\\{\\{ - 匹配两个 {{
2. [Ii]nfobox - 匹配信息框或信息框
3。 ([^\\}\\}].*\\n+)* - 匹配信息框的正文(正文不包含 }} 并且包含任意次数的任何类型的字符)
----3.a。[^\\}] - 匹配除 }
----3.b 之外的所有内容。.* - 匹配任意字符任意次数
----3.c。\n+ - 匹配新行 1 次或更多次
4。 \\}\\} - 匹配 - 以 }} 结尾

于 2013-11-07T18:34:07.353 回答
1

解决方案取决于{{ .. }}块内infobox块的嵌套深度。如果内部块不嵌套,即有{{ ... }}块但没有{{ .. {{ .. }} .. }}块,那么您可以尝试正则表达式:infobox([^\\{]*(\\{\\{[^\\}]*\\}\\})*.*?)\\}\\}

我在字符串上测试了这个:"A {{ start {{infobox abc {{ efg }} hij }}end }} B"并且能够匹配" abc {{ efg }} hij "

如果块的嵌套{{ .. }}更深,那么正则表达式将无济于事,因为您无法向正则表达式引擎指定内部块有多大。为此,您需要计算开始{{和结束}}序列的数量并以这种方式提取字符串。这意味着您最好一次阅读一个字符并处理它。

正则表达式的解释

我们开始,infobox然后打开组捕获括号。然后我们寻找一串不是 NOT 的字符{

之后,我们寻找零个或多个“组”形式{{ .. }}其中没有嵌套块)。这里不允许嵌套,因为我们[^\\}]过去只允许}块内的非字符来查找块的结尾。

最后,我们在结束之前接受字符}}

于 2013-11-07T19:13:29.390 回答
0

如果您的 xmlPage.getText() 将返回类似于此的内容:

{{infobox ... }}{infobox .... {{ nested stuff }} }}{{infobox ... }} 您将在同一级别上同时拥有多个信息框以及嵌套内容(以及嵌套级别可以是任何东西)那么你不能使用正则表达式来解析内容。为什么 ?因为该结构的行为方式与 html 或 xml 类似,因此它的行为不像常规结构。您可以找到有关“正则表达式和 html”主题的多个答案,以找到对此问题的良好解释。例如这里: 为什么不能使用正则表达式来解析 HTML/XML:外行术语的正式解释

但是,如果您可以保证在同一级别上不会有多个信息框,而只有嵌套的信息框,那么您可以解析删除“?”的文档。

于 2013-11-07T19:17:25.500 回答
0
public static void extractValuesTest(String[] args) {
        String payloadformatstr= "selected card is |api:card_number| with |api:title|";
        String receivedInputString= "siddiselected card is 1234567 with dbs card";
        int firstIndex = payloadformatstr.indexOf("|");
        List<String> slotSplits= extarctString(payloadformatstr, "\\|(.*?)\\|");
        String[] mainSplits = payloadformatstr.split("\\|(.*?)\\|");
        int mainsplitLength = mainSplits.length;
        int slotNumber=0;
        Map<String,String> parsedValues = new HashMap<>();
        String replaceString="";
        int receivedstringLength = receivedInputString.length();
        for (String slot : slotSplits) {
            String[] slotArray = slot.split(":");
            int processLength = slotArray !=null ? slotArray.length : 0;
            String slotType = null;
            String slotKey = null;
            if(processLength == 2){
                slotType = slotArray[0];
                slotKey = slotArray[1];
            }
            /*String slotBefore= (firstIndex != 0 && slotNumber < mainsplitLength) ? mainSplits[slotNumber]:"";
            String slotAfter= (firstIndex != 0 && slotNumber+1 < mainsplitLength) ? mainSplits[slotNumber+1]:"";
            int startIndex = receivedInputString.indexOf(slotBefore)+slotBefore.length();
            int endIndex = receivedInputString.indexOf(slotAfter);
            String extractedValue = receivedInputString.substring(startIndex, endIndex);*/
            String slotBefore= (firstIndex != 0 && slotNumber < mainsplitLength) ? mainSplits[slotNumber]:null;
            String slotAfter= (firstIndex != 0 && slotNumber+1 < mainsplitLength) ? mainSplits[slotNumber+1]:null;
            int startIndex = StringUtils.isEmpty(slotBefore) ?  0:receivedInputString.indexOf(slotBefore)+slotBefore.length();
            //int startIndex = receivedInputString.indexOf(slotBefore)+slotBefore.length();
            int endIndex =  StringUtils.isEmpty(slotAfter) ? receivedstringLength: receivedInputString.indexOf(slotAfter);
            String extractedValue = (endIndex != receivedstringLength) ? receivedInputString.substring(startIndex, endIndex): 
                receivedInputString.substring(startIndex);
            System.out.println("Extracted value is "+extractedValue);
            parsedValues.put(slotKey, extractedValue);
            replaceString+=slotBefore+(extractedValue != null ? extractedValue:"");
            //String extractedValue = extarctSlotValue(receivedInputString,slotBefore,slotAfter);
            slotNumber++;
        }
        System.out.println(replaceString);
        System.out.println(parsedValues);
    }

    public static void replaceTheslotsWithValues(String payloadformatstr,String receivedInputString,String slotPattern,String statPatternOfSlot) {
        payloadformatstr= "selected card is |api:card_number| with |api:title|.";
        receivedInputString= "selected card is 1234567 with dbs card.";
        slotPattern="\\|(.*?)\\|";
        statPatternOfSlot="|";
        int firstIndex = payloadformatstr.indexOf(statPatternOfSlot);
        List<String> slotSplits= extarctString(payloadformatstr, slotPattern);
        String[] mainSplits = payloadformatstr.split(slotPattern);
        int mainsplitLength = mainSplits.length;
        int slotNumber=0;
        Map<String,String> parsedValues = new HashMap<>();
        String replaceString="";
        for (String slot : slotSplits) {
            String[] slotArray = slot.split(":");
            int processLength = slotArray !=null ? slotArray.length : 0;
            String slotType = null;
            String slotKey = null;
            if(processLength == 2){
                slotType = slotArray[0];
                slotKey = slotArray[1];
            }
            String slotBefore= (firstIndex != 0 && slotNumber < mainsplitLength) ? mainSplits[slotNumber]:"";
            String slotAfter= (firstIndex != 0 && slotNumber+1 < mainsplitLength) ? mainSplits[slotNumber+1]:"";
            int startIndex = receivedInputString.indexOf(slotBefore)+slotBefore.length();
            int endIndex = receivedInputString.indexOf(slotAfter);
            String extractedValue = receivedInputString.substring(startIndex, endIndex);
            System.out.println("Extracted value is "+extractedValue);
            parsedValues.put(slotKey, extractedValue);
            replaceString+=slotBefore+(extractedValue != null ? extractedValue:"");
            //String extractedValue = extarctSlotValue(receivedInputString,slotBefore,slotAfter);
            slotNumber++;
        }
        System.out.println(replaceString);
        System.out.println(parsedValues);
    }
于 2018-12-11T04:39:43.033 回答