2

我正在尝试在java中使用String split,将整个文档拆分为制表符空格和换行符之间的子字符串,但我想排除引号之间存在单词的情况。

例子:

这个文件

CATEGORYTYPE1
{
    CATEGORYSUBTYPE1
    {
        OPTION1 “ABcd efg1234”
        OPTION2 ABCdefg12345
        OPTION3 15
    }
    CATEGORYSUBTYPE2
    {
        OPTION1 “Blah Blah 123”
        OPTION2 Blah
        OPTION3 10
        OPTION4 "Blah"
    }
}

拆分为这些子字符串(如 Eclipse 调试器中所示):

[CATEGORYTYPE1, {, CATEGORYTYPE1, {, OPTION1, “ABcd, efg1234”, OPTION2....

当我使用我当前的正则表达式时:

    String regex = "([\\n\\r\\s\\t]+)";

    String[] tokens = data.split(regex);

但我想要实现的是像这样拆分它:

[CATEGORYTYPE1, {, CATEGORYTYPE1, {, OPTION1, “ABcd efg1234”, OPTION2....

(不拆分引号之间的内容)

这可能与正则表达式有关吗?怎么做?

4

3 回答 3

2

Here is one way of doing this:

str = "CATEGORYTYPE1\n" + 
"{\n" + 
"    CATEGORYSUBTYPE1\n" + 
"    {\n" + 
"        OPTION1 \"ABcd efg1234\"\n" + 
"        OPTION2 ABCdefg12345\n" + 
"        OPTION3 15\n" + 
"    }\n" + 
"    CATEGORYSUBTYPE2\n" + 
"    {\n" + 
"        OPTION1 \"Blah Blah 123\"\n" + 
"        OPTION2 Blah\n" + 
"        OPTION3 10\n" + 
"        OPTION4 \"Blah\"\n" + 
"    }\n" + 
"}\n";

String[] arr = str.split("(?s)(?=(([^\"]+\"){2})*[^\"]*$)\\s+");
System.out.println(Arrays.toString(arr));

// OUTPUT
[CATEGORYTYPE1, {, CATEGORYSUBTYPE1, {, OPTION1, "ABcd efg1234", OPTION2, ABCdefg12345, ...

Explanation: It means match space or new line (\s) followed by EVEN number of double quotes ("). Hence \s between 2 double quotes characters will NOT be used in split and outside ones will be matched (since those are followed by even number of double quotes characters).

于 2013-05-20T19:53:50.997 回答
1
于 2013-05-20T19:27:46.140 回答
0

I know I joined the party rather late, but if you are looking for a fancy regex to "understand" escaped " as well, this one should work for you:

Pattern p = Pattern.compile("(\\S*?\".*?(?<!\\\\)\")+\\S*|\\S+");
Matcher m = p.matcher(str);
while (m.find()) { ... }

It will also parse something like this:
ab "cd \"ef\" gh" ij "kl \"no pq\"\" rs"
to:
ab, "cd \"ef\" gh", ij, "kl \"no pq\"\" rs" (not getting confused by the odd number of escaped quotes (\").

(Probably irrelevant, but this one will also "understand" " in the middle of a string, so it will parse this: ab c" "d ef to: ab, c" "d, ef - not that such a pattern is likely to emerge.)

Anyway, you can also take a look at this short demo.

于 2013-05-20T22:28:34.000 回答