java - 正则表达式拆分章节标题

Question

我需要将章节标题拆分为标题编号和标题名称。章节标题格式为：

some long text
    3.7.2 sealant durability 
     paragraph with text        // (.*)
    3.7.3 funkční schopnost
     paragraph with text...
    3.1.13 plastic sealant xx 21    
     paragraph with text
    3.1.14 plastic sealant 
    xx 21   
     paragraph with text
    3.7.12 sealant durability
     paragraph with text
    3.7.325 funkční schopnost

编辑： 问题是，说明值在长文本之间，充满特殊字符。

我曾经遵循代码，但它只在最后一个点之后拆分了最后一个数字。当我在最后一个“\d”之后填充一个“+”字符时，会引发错误。这个问题的正确正则表达式是什么？

title.trim().split("(?<=(\\d\\.\\d{1,2}\\.[\\d]))")

预期输出：

splitedValue[0] : '3.7.2'
splitedValue[1] : 'sealant durability'
...
splitedValue[0] : '3.1.14'
splitedValue[1] : 'plastic sealant xx 21'
...

在此处输入图像描述

score 4 · Accepted Answer

您是否有理由无法indexOf(' ')找到第一个空白字符，然后在任一侧找到子字符串？对于您和五年后查看代码时，这可能更容易使用。

score 2 · Accepted Answer

与使用带有数字和标题组的预编译正则表达式相比，使用split不太适合您的情况。以下是解析您的测试用例的片段：

public static void main(String[] args) {
    Pattern pattern = Pattern.compile("([\\d\\.]+)\\s+(.*)", Pattern.MULTILINE | Pattern.DOTALL);

    List<String> input = Arrays.asList(
            "3.7.2 sealant durability",
            "3.7.3 funkční schopnost",
            "3.1.14 plastic sealant xx 21",
            "3.1.14 plastic sealant\n" +
                    "xx 21",
            "3.7.12 sealant durability",
            "3.7.325 funkční schopnost");

    for (String s : input) {
        Matcher matcher = pattern.matcher(s);
        System.out.println("Input:" + s);
        if (matcher.matches()) {
            System.out.println("Number: " + matcher.group(1));
            System.out.println("Title: '" + matcher.group(2)+"'");
        }
        System.out.println();
    }
}

score 1 · Accepted Answer

你可以试试正则表达式：

 *(\d+(\.\d+)*) (\p{L}+( \p{L}+)*)

\p{L}表示 Unicode 字母的类别。另一方面，您需要使用 Pattern 的常量来避免每次都重新编译表达式，如下所示：

private static final Pattern REGEX_PATTERN = 
        Pattern.compile(" *(\\d+(\\.\\d+)*) (\\p{L}+( \\p{L}+)*)");

public static void main(String[] args) {
    String input = "    3.7.2 sealant durability \n     paragraph with text        // (.*)\n    3.7.3 funkční schopnost\n     paragraph with text...\n    3.1.13 plastic sealant xx 21    \n     paragraph with text";

    Matcher matcher = REGEX_PATTERN.matcher(input);
    while (matcher.find()) {
        System.out.println(matcher.group(1)); // Chapter
        System.out.println(matcher.group(3)); // Title
    }
}

使用matcher.find()而不是split().

输出：

3.7.2
sealant durability
3.7.3
funkční schopnost
3.1.13
plastic sealant xx

score 0 · Accepted Answer

正如@EricStein 指出的那样，找到第一次出现的空格是个好主意。您也可以尝试更灵活的方法，如下所示：

String name = "3.7.2 sealant durability";
System.out.println(name.split("\\s+", 2)[1]);

密封胶耐久性

更一般地说，为了匹配您的预期输出：

String[] splitedValue = name.split("\\s+", 2);

java - 正则表达式拆分章节标题

4 回答 4

Related

Reference