java - 用分隔符拆分带引号的字符串

Question

我想用分隔符空格分割一个字符串。但它应该智能地处理带引号的字符串。例如对于像这样的字符串

"John Smith" Ted Barry

它应该返回三个字符串 John Smith、Ted 和 Barry。

score 10 · Accepted Answer

在弄乱它之后，您可以为此使用正则表达式。运行相当于“匹配所有”的：

((?<=("))[\w ]*(?=("(\s|$))))|((?<!")\w+(?!"))

Java 示例：

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Test
{ 
    public static void main(String[] args)
    {
        String someString = "\"Multiple quote test\" not in quotes \"inside quote\" \"A work in progress\"";
        Pattern p = Pattern.compile("((?<=(\"))[\\w ]*(?=(\"(\\s|$))))|((?<!\")\\w+(?!\"))");
        Matcher m = p.matcher(someString);

        while(m.find()) {
            System.out.println("'" + m.group() + "'");
        }
    }
}

输出：

'Multiple quote test'
'not'
'in'
'quotes'
'inside quote'
'A work in progress'

可以在此处查看上面使用的示例的正则表达式细分：

http://regex101.com/r/wM6yT9

尽管如此，正则表达式不应该是所有事情的解决方案——我只是在玩。这个例子有很多边缘情况，例如处理 unicode 字符、符号等。对于这类任务，最好使用经过验证的真实库。在使用此答案之前，请查看其他答案。

score 4 · Accepted Answer

试试这段丑陋的代码。

    String str = "hello my dear \"John Smith\" where is Ted Barry";
    List<String> list = Arrays.asList(str.split("\\s"));
    List<String> resultList = new ArrayList<String>();
    StringBuilder builder = new StringBuilder();
    for(String s : list){
        if(s.startsWith("\"")) {
            builder.append(s.substring(1)).append(" ");
        } else {
            resultList.add((s.endsWith("\"") 
                    ? builder.append(s.substring(0, s.length() - 1)) 
                    : builder.append(s)).toString());
            builder.delete(0, builder.length());
        }
    }
    System.out.println(resultList);

score 3 · Accepted Answer

好吧，我做了一个小片段，可以做你想做的事情和更多的事情。由于您没有指定更多条件，因此我没有遇到麻烦。我知道这是一种肮脏的方式，您可能可以通过已经制作的东西获得更好的结果。但是为了编程的乐趣，这里是一个例子：

    String example = "hello\"John Smith\" Ted Barry lol\"Basi German\"hello";
    int wordQuoteStartIndex=0;
    int wordQuoteEndIndex=0;

    int wordSpaceStartIndex = 0;
    int wordSpaceEndIndex = 0;

    boolean foundQuote = false;
    for(int index=0;index<example.length();index++) {
        if(example.charAt(index)=='\"') {
            if(foundQuote==true) {
                wordQuoteEndIndex=index+1;
                //Print the quoted word
                System.out.println(example.substring(wordQuoteStartIndex, wordQuoteEndIndex));//here you can remove quotes by changing to (wordQuoteStartIndex+1, wordQuoteEndIndex-1)
                foundQuote=false;
                if(index+1<example.length()) {
                    wordSpaceStartIndex = index+1;
                }
            }else {
                wordSpaceEndIndex=index;
                if(wordSpaceStartIndex!=wordSpaceEndIndex) {
                    //print the word in spaces
                    System.out.println(example.substring(wordSpaceStartIndex, wordSpaceEndIndex));
                }
                wordQuoteStartIndex=index;
                foundQuote = true;
            }
        }

        if(foundQuote==false) {
            if(example.charAt(index)==' ') {
                wordSpaceEndIndex = index;
                if(wordSpaceStartIndex!=wordSpaceEndIndex) {
                    //print the word in spaces
                    System.out.println(example.substring(wordSpaceStartIndex, wordSpaceEndIndex));
                }
                wordSpaceStartIndex = index+1;
            }

            if(index==example.length()-1) {
                if(example.charAt(index)!='\"') {
                    //print the word in spaces
                    System.out.println(example.substring(wordSpaceStartIndex, example.length()));
                }
            }
        }
    }

这还会检查引号前后没有用空格分隔的单词，例如“John Smith”之前和“Basi German”之后的单词“hello”。

当字符串被修改为"John Smith" Ted Barry输出时是三个字符串，1) "John Smith" 2) Ted 3) Barry

示例中的字符串是 hello"John Smith" Ted Barry lol"Basi German"hello 并打印 1)hello 2)"John Smith" 3)Ted 4)Barry 5)lol 6)"Basi German" 7)hello

希望能帮助到你

score 1 · Accepted Answer

commons-lang 有一个 StrTokenizer 类来为你做这件事，还有 java-csv 库。

使用 StrTokenizer 的示例：

String params = "\"John Smith\" Ted Barry"
// Initialize tokenizer with input string, delimiter character, quote character
StrTokenizer tokenizer = new StrTokenizer(params, ' ', '"');
for (String token : tokenizer.getTokenArray()) {
   System.out.println(token);
}

输出：

John Smith
Ted
Barry

score 1 · Accepted Answer

这是我自己的版本，从http://pastebin.com/aZngu65y清理（发布在评论中）。它可以处理 Unicode。它会清理所有多余的空间（即使是在报价中）——这取决于需要是好是坏。不支持转义引用。

private static String[] parse(String param) {
  String[] output;

  param = param.replaceAll("\"", " \" ").trim();
  String[] fragments = param.split("\\s+");

  int curr = 0;
  boolean matched = fragments[curr].matches("[^\"]*");
  if (matched) curr++;

  for (int i = 1; i < fragments.length; i++) {
    if (!matched)
      fragments[curr] = fragments[curr] + " " + fragments[i];

    if (!fragments[curr].matches("(\"[^\"]*\"|[^\"]*)"))
      matched = false;
    else {
      matched = true;

      if (fragments[curr].matches("\"[^\"]*\""))
        fragments[curr] = fragments[curr].substring(1, fragments[curr].length() - 1).trim();

      if (fragments[curr].length() != 0)
        curr++;

      if (i + 1 < fragments.length)
        fragments[curr] = fragments[i + 1];
    }
  }

  if (matched) { 
    return Arrays.copyOf(fragments, curr);
  }

  return null; // Parameter failure (double-quotes do not match up properly).
}

用于比较的样本输入：

"sdfskjf" sdfjkhsd "hfrif ehref" "fksdfj sdkfj fkdsjf" sdf sfssd


asjdhj    sdf ffhj "fdsf   fsdjh"
日本語　中文 "Tiếng Việt" "English"
    dsfsd    
   sdf     " s dfs    fsd f   "  sd f   fs df  fdssf  "日本語　中文"
""   ""     ""
"   sdfsfds "   "f fsdf

（第 2 行为空，第 3 行为空格，最后一行格式错误）。请根据您自己的预期输出来判断，因为它可能会有所不同，但基线是，第一种情况应该返回 [sdfskjf, sdfjkhsd, hfrif ehref, fksdfj sdkfj fkdsjf, sdf, sfssd]。

java - 用分隔符拆分带引号的字符串

5 回答 5

Related

Reference