java - 具有多行和特殊结构的字符串的正则表达式

Question

我正在使用 Java 并想构建两个适合两种不同场景的 reg 表达式：

1：

STARTText blah, blah
\    next line with more text, but the leading backslash
\    next line with more text, but the leading backslash
\    next line with more text, but the leading backslash

直到第一行不再以反斜杠开头。

2：

Now you will see the following links for the items:
1111 leading 4 digits and then some text
2565 leading 4 digits and then some text
8978 leading 4 digits and then some text

并且这个块以一个额外的空行结束，例如 8978。但另外我知道，带有起始数字的块将重复 10 次然后完成。

所以过滤单行是有可能的，但是如何在中间有多个换行符呢？甚至当我真的不知道何时/如何结束它时，即使是第一个块。还搜索反斜杠。所以，我的方法是有一个封闭的表达式，只有一个 - 我也可以用于 replaceAll()

score 1 · Accepted Answer

第一个正则表达式：

Pattern regex = Pattern.compile(
    "^          # Start of line\n" +
    "STARTText  # Match this text\n" +
    ".*\\r?\\n  # Match whatever follows on the line plus (CR)LF\n" +
    "(?:        # Match...\n" +
    " ^\\\\     # Start of line, then a backslash\n" +
    " .*\\r?\\n # Match whatever follows on the line plus (CR)LF\n" +
    ")*         # Repeat as needed", 
    Pattern.MULTILINE | Pattern.COMMENTS);

第二个正则表达式：

Pattern regex = Pattern.compile(
    "(?:        # Match...\n" +
    " ^         # Start of line\n" +
    " \\d{4}\\b # Match exactly four digits\n" +
    " .*\\r?\\n # Match whatever follows on the line plus (CR)LF\n" +
    ")+         # Repeat as needed (at least once)", 
    Pattern.MULTILINE | Pattern.COMMENTS);

score 1 · Accepted Answer

正则表达式 1：

/^STARTText.*?(\r?\n)(?:^\\.*?\1)+/m

现场演示： http ://www.rubular.com/r/G35kIn3hQ4

正则表达式 2：

/^.*?(\r?\n)(?:^\d{4}\s.*?\1)+/m

现场演示： http ://www.rubular.com/r/TxFbBP1jLJ

编辑：

Java 演示 1：http: //ideone.com/BPNrm6

Java中的正则表达式1：

(?m)^STARTText.*?(\\r?\\n)(?:^\\\\.*?\\1)+

Java 演示 2：http: //ideone.com/TQB8Gs

Java中的正则表达式2：

(?m)^.*?(\\r?\\n)(?:^\\d{4}\\s.*?\\1)+

score 1 · Accepted Answer

在这两种情况下，我都使用零断言前瞻，(?=^[^\\])以确保下一行继续具有我正在寻找的内容。

(?=启动零断言前瞻，这需要该值存在但不消耗该值
^[^\\]匹配一行的开头，后跟任何字符，然后是\
)关闭断言

第1部分

这将匹配第 1 部分的所有文本，其中捕获的第一行后跟任意数量的带有\.

^([^\\].*?)(?=^[^\\])

正则表达式图片

在 Debuggex 上实时编辑

    Java Code Example:
    import java.util.regex.Pattern;
    import java.util.regex.Matcher;
    class Module1{
      public static void main(String[] asd){
      String sourcestring = "STARTFirstText blah, blah
\    1next line with more text, but the leading backslash
\    2next line with more text, but the leading backslash
\    3next line with more text, but the leading backslash
STARTsecondText blah, blah
\    4next line with more text, but the leading backslash
\    5next line with more text, but the leading backslash
\    6next line with more text, but the leading backslash
foo";
      Pattern re = Pattern.compile("^([^\\\\].*?)(?=^[^\\\\])",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
      Matcher m = re.matcher(sourcestring);
      int mIdx = 0;
        while (m.find()){
          for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
            System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
          }
          mIdx++;
        }
      }
    }

    $matches Array:
    (
        [0] => Array
            (
                [0] => STARTFirstText blah, blah
    \    1next line with more text, but the leading backslash
    \    2next line with more text, but the leading backslash
    \    3next line with more text, but the leading backslash

                [1] => STARTsecondText blah, blah
    \    4next line with more text, but the leading backslash
    \    5next line with more text, but the leading backslash
    \    6next line with more text, but the leading backslash

            )

        [1] => Array
            (
                [0] => STARTFirstText blah, blah
    \    1next line with more text, but the leading backslash
    \    2next line with more text, but the leading backslash
    \    3next line with more text, but the leading backslash

                [1] => STARTsecondText blah, blah
    \    4next line with more text, but the leading backslash
    \    5next line with more text, but the leading backslash
    \    6next line with more text, but the leading backslash

            )

    )

第2部分

这将匹配第一行，然后是几行以数字开头的行

^([^\d].*?)(?=^[^\d])

正则表达式图片

在 Debuggex 上实时编辑

例子

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "First you will see the following links for the items:
1111 leading 4 digits and then some text
2565 leading 4 digits and then some text
8978 leading 4 digits and then some text

Second you will see the following links for the items:
2222 leading 4 digits and then some text
3333 leading 4 digits and then some text
4444 leading 4 digits and then some text";
  Pattern re = Pattern.compile("^([^\\d].*?)(?=^[^\\d])",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

$matches Array:
(
    [0] => Array
        (
            [0] => First you will see the following links for the items:
1111 leading 4 digits and then some text
2565 leading 4 digits and then some text
8978 leading 4 digits and then some text

            [1] => 

        )

    [1] => Array
        (
            [0] => First you will see the following links for the items:
1111 leading 4 digits and then some text
2565 leading 4 digits and then some text
8978 leading 4 digits and then some text

            [1] => 

        )

)

score 0 · Accepted Answer

使用 '\' 作为反斜杠，使用 '\r\n|\r' 作为一个换行符，使用 '\d{4}' 作为 4 位数字：

.*(\r|r\n)

（你的第一个废话）

\\.*(\r|r\n)

（你的反斜杠线）

((\d{4}.*(\r|r\n))+(\r|\r\n))+

（您的 4 位数字块以空行结尾，整个以 + 重复）

java - 具有多行和特殊结构的字符串的正则表达式

4 回答 4

编辑：

Java 演示 1：http: //ideone.com/BPNrm6

Java 演示 2：http: //ideone.com/TQB8Gs

第1部分

第2部分

Related

Reference