java - JAVA中带有反向引用的递归组捕获正则表达式

Question

我正在尝试使用对正则表达式中组的反向引用以递归方式在字符串中捕获多个组。即使我使用的是 Pattern 和 Matcher 以及“while(matcher.find())”循环，它仍然只捕获最后一个实例而不是所有实例。在我的例子中，唯一可能的标签是 <sm>,<po>,<pof>,<pos>,<poi>,<pol>,<poif>,<poil>。由于这些是格式化标签，我需要捕获：

标签之外的任何文本（这样我就可以将其格式化为“普通”文本，我将通过在一组中的标签之前捕获任何文本来解决此问题，同时我在另一组中捕获标签本身，并且当我迭代时我删除了从原始字符串中捕获的所有内容；如果最后剩下任何文本，我将其格式化为“普通”文本）
标签的“名称”，以便我知道如何格式化标签内的文本
标签的文本内容将根据标签名称及其相关规则进行格式化

这是我的示例代码：

        String currentText = "the man said:<pof>“This one, at last, is bone of my bones</pof><poi>and flesh of my flesh;</poi><po>This one shall be called ‘woman,’&lt;/po><poil>for out of man this one has been taken.”&lt;/poil>";
        String remainingText = currentText;

        //first check if our string even has any kind of xml tag, because if not we will just format the whole string as "normal" text
        if(currentText.matches("(?su).*<[/]{0,1}(?:sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1}>.*"))
        {                
            //an opening or closing tag has been found, so let us start our pattern captures
            //I am using a backreference \\2 to make sure the closing tag is the same as the opening tag
            Pattern pattern1 = Pattern.compile("(.*)<((sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1})>(.*?)</\\2>",Pattern.UNICODE_CHARACTER_CLASS);
            Matcher matcher1 = pattern1.matcher(currentText);                
            int iteration = 0;
            while(matcher1.find()){
                System.out.print("Iteration ");
                System.out.println(++iteration);
                System.out.println("group1:"+matcher1.group(1));
                System.out.println("group2:"+matcher1.group(2));
                System.out.println("group3:"+matcher1.group(3));
                System.out.println("group4:"+matcher1.group(4));

                if(matcher1.group(1) != null && matcher1.group(1).isEmpty() == false)
                {
                    m_xText.insertString(xTextRange, matcher1.group(1), false);
                    remainingText = remainingText.replaceFirst(matcher1.group(1), "");
                }
                if(matcher1.group(4) != null && matcher1.group(4).isEmpty() == false)
                {
                    switch (matcher1.group(2)) {
                        case "pof": [...]
                        case "pos": [...]
                        case "poif": [...]
                        case "po": [...]
                        case "poi": [...]
                        case "pol": [...]
                        case "poil": [...]
                        case "sm": [...]
                    }
                    remainingText = remainingText.replaceFirst("<"+matcher1.group(2)+">"+matcher1.group(4)+"</"+matcher1.group(2)+">", "");
                }
            }

System.out.println 在我的控制台中只输出一次，结果如下：

Iteration 1:
  group1:the man said:<pof>“This one, at last, is bone of my bones</pof><poi>and flesh of my flesh;</poi><po>This one shall be called ‘woman,’&lt;/po>; 
  group2:poil
  group3:po
  group4:for out of man this one has been taken.”

第 3 组将被忽略，唯一有用的组是 1、2 和 4（第 3 组是第 2 组的一部分）。为什么只捕获最后一个标签实例“poil”，而不捕获前面的“pof”、“poi”和“po”标签？

我想看到的输出是这样的：

Iteration 1:
  group1:the man said:
  group2:pof
  group3:po
  group4:“This one, at last, is bone of my bones

Iteration 2:
  group1:
  group2:poi
  group3:po
  group4:and flesh of my flesh;

Iteration 3:
  group1:
  group2:po
  group3:po
  group4:This one shall be called ‘woman,’

Iteration 3:
  group1:
  group2:poil
  group3:po
  group4:for out of man this one has been taken.”

score 1 · Accepted Answer

我刚刚找到了这个问题的答案，它只需要在第一个捕获中使用非贪婪量词，就像我在第四个捕获组中一样。这完全按照需要工作：

Pattern pattern1 = Pattern.compile("(.*?)<((sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1})>(.*?)</\\2>",Pattern.UNICODE_CHARACTER_CLASS);

java - JAVA中带有反向引用的递归组捕获正则表达式

1 回答 1

Related

Reference