language-agnostic - 拆分字符串忽略引用的部分

Question

给定这样的字符串：

a,"string, with",various,"values, and some",quoted

什么是基于逗号分割它而忽略引用部分内的逗号的好算法？

输出应该是一个数组：

[ "a", "string, with", "various", "values, and some", "quoted" ]

score 21 · Accepted Answer

看起来你在这里得到了一些很好的答案。

对于那些希望处理自己的 CSV 文件解析的人，请听从专家的建议，不要使用自己的 CSV 解析器。

您的第一个想法是，“我需要处理引号内的逗号。”

您的下一个想法将是，“哦，废话，我需要处理引号内的引号。转义引号。双引号。单引号......”

这是一条通往疯狂的道路。不要自己写。找到一个具有广泛单元测试覆盖率的库，该库涵盖了所有困难的部分，并为你经历了地狱。对于 .NET，使用免费的FileHelpers库。

score 6 · Accepted Answer

6

Python：

import csv
reader = csv.reader(open("some.csv"))
for row in reader:
    print row

于 2008-08-08T21:07:28.450 回答

score 2 · Accepted Answer

当然，使用 CSV 解析器会更好，但只是为了好玩，您可以：

Loop on the string letter by letter.
    If current_letter == quote : 
        toggle inside_quote variable.
    Else if (current_letter ==comma and not inside_quote) : 
        push current_word into array and clear current_word.
    Else 
        append the current_letter to current_word
When the loop is done push the current_word into array

score 1 · Accepted Answer

作者在这里放入了一个处理您遇到问题的场景的 C# 代码块：

.Net 中的 CSV 文件导入

翻译应该不会太难。

score 1 · Accepted Answer

如果我选择的语言没有提供一种不假思索地做到这一点的方法，那么我最初会考虑两种选择作为简单的出路：

预解析并用另一个控制字符替换字符串中的逗号，然后拆分它们，然后对数组进行后解析，以用逗号替换之前使用的控制字符。
或者将它们拆分为逗号，然后将生成的数组解析为另一个数组，检查每个数组条目上的前导引号并将条目连接起来，直到我到达终止引号。

然而，这些都是技巧，如果这是一个纯粹的“心理”练习，那么我怀疑它们将被证明是无用的。如果这是一个现实世界的问题，那么了解该语言将有助于我们提供一些具体的建议。

score 1 · Accepted Answer

如果原始字符串中出现奇数个引号怎么办？

这看起来非常像 CSV 解析，它在处理引用字段时有一些特殊性。仅当字段用双引号分隔时，该字段才被转义，因此：

字段 1，“字段 2，字段 3”，字段 4，“字段 5，字段 6”字段 7

变成

字段1

字段 2，字段 3

字段4

“字段5

场 6" 场 7

请注意，如果它不以引号开头和结尾，则它不是带引号的字段，并且双引号被简单地视为双引号。

如果我没记错的话，我链接到的代码实际上并没有正确处理这个问题。

score 1 · Accepted Answer

这是一个基于 Pat 伪代码的简单 python 实现：

def splitIgnoringSingleQuote(string, split_char, remove_quotes=False):
    string_split = []
    current_word = ""
    inside_quote = False
    for letter in string:
      if letter == "'":
        if not remove_quotes:
           current_word += letter
        if inside_quote:
          inside_quote = False
        else:
          inside_quote = True
      elif letter == split_char and not inside_quote:
        string_split.append(current_word)
        current_word = ""
      else:
        current_word += letter
    string_split.append(current_word)
    return string_split

score 0 · Accepted Answer

我用它来解析字符串，不确定它在这里是否有帮助；但也许有一些小的修改？

function getstringbetween($string, $start, $end){
    $string = " ".$string;
    $ini = strpos($string,$start);
    if ($ini == 0) return "";
    $ini += strlen($start);   
    $len = strpos($string,$end,$ini) - $ini;
    return substr($string,$ini,$len);
}

$fullstring = "this is my [tag]dog[/tag]";
$parsed = getstringbetween($fullstring, "[tag]", "[/tag]");

echo $parsed; // (result = dog)

/mp

score 0 · Accepted Answer

这是一个简单的算法：

判断字符串是否以字符'"'开头
将字符串拆分为由'"'字符分隔的数组。
用占位符标记引用的逗号#COMMA#
- 如果输入以 a 开头，'"'则在数组中标记索引 % 2 == 0 的那些项目
- 否则标记数组中索引 % 2 == 1 的那些项目
连接数组中的项目以形成修改后的输入字符串。
将字符串拆分为由','字符分隔的数组。
#COMMA#用字符替换占位符数组中的所有实例','。
该数组是您的输出。

这是python实现：（
固定处理'"a，b"，c，"d，e，f，h"，"i，j，k"'）

def parse_input(input):

    quote_mod = int(not input.startswith('"'))

    input = input.split('"')
    for item in input:
        if item == '':
            input.remove(item)
    for i in range(len(input)):
        if i % 2 == quoted_mod:
            input[i] = input[i].replace(",", "#COMMA#")

    input = "".join(input).split(",")
    for item in input:
        if item == '':
            input.remove(item)
    for i in range(len(input)):
        input[i] = input[i].replace("#COMMA#", ",")
    return input

# parse_input('a,"string, with",various,"values, and some",quoted')
#  -> ['a,string', ' with,various,values', ' and some,quoted']
# parse_input('"a,b",c,"d,e,f,h","i,j,k"')
#  -> ['a,b', 'c', 'd,e,f,h', 'i,j,k']

score 0 · Accepted Answer

这是标准的 CSV 样式解析。很多人尝试使用正则表达式来做到这一点。使用正则表达式可以达到大约 90%，但您确实需要一个真正的 CSV 解析器才能正确执行此操作。几个月前，我在 CodeProject 上发现了一个快速、出色的 C# CSV 解析器，我强烈推荐它！

score 0 · Accepted Answer

这是一次使用伪代码（又名 Python）的代码：-P

def parsecsv(instr):
    i = 0
    j = 0

    outstrs = []

    # i is fixed until a match occurs, then it advances
    # up to j. j inches forward each time through:

    while i < len(instr):

        if j < len(instr) and instr[j] == '"':
            # skip the opening quote...
            j += 1
            # then iterate until we find a closing quote.
            while instr[j] != '"':
                j += 1
                if j == len(instr):
                    raise Exception("Unmatched double quote at end of input.")

        if j == len(instr) or instr[j] == ',':
            s = instr[i:j]  # get the substring we've found
            s = s.strip()    # remove extra whitespace

            # remove surrounding quotes if they're there
            if len(s) > 2 and s[0] == '"' and s[-1] == '"':
                s = s[1:-1]

            # add it to the result
            outstrs.append(s)

            # skip over the comma, move i up (to where
            # j will be at the end of the iteration)
            i = j+1

        j = j+1

    return outstrs

def testcase(instr, expected):
    outstr = parsecsv(instr)
    print outstr
    assert expected == outstr

# Doesn't handle things like '1, 2, "a, b, c" d, 2' or
# escaped quotes, but those can be added pretty easily.

testcase('a, b, "1, 2, 3", c', ['a', 'b', '1, 2, 3', 'c'])
testcase('a,b,"1, 2, 3" , c', ['a', 'b', '1, 2, 3', 'c'])

# odd number of quotes gives a "unmatched quote" exception
#testcase('a,b,"1, 2, 3" , "c', ['a', 'b', '1, 2, 3', 'c'])

score 0 · Accepted Answer

我只是忍不住想看看我是否可以让它在 Python 单行中工作：

arr = [i.replace("|", ",") for i in re.sub('"([^"]*)\,([^"]*)"',"\g<1>|\g<2>", str_to_test).split(",")]

返回 ['a', 'string, with', 'various', 'values, and some', 'quoted']

它首先将引号内的 ',' 替换为另一个分隔符 (|)，将字符串拆分为 ',' 并替换 | 再次分隔符。

score 0 · Accepted Answer

既然你说语言不可知论，我尽可能用最接近伪代码的语言编写我的算法：

def find_character_indices(s, ch):
    return [i for i, ltr in enumerate(s) if ltr == ch]


def split_text_preserving_quotes(content, include_quotes=False):
    quote_indices = find_character_indices(content, '"')

    output = content[:quote_indices[0]].split()

    for i in range(1, len(quote_indices)):
        if i % 2 == 1: # end of quoted sequence
            start = quote_indices[i - 1]
            end = quote_indices[i] + 1
            output.extend([content[start:end]])

        else:
            start = quote_indices[i - 1] + 1
            end = quote_indices[i]
            split_section = content[start:end].split()
            output.extend(split_section)

        output += content[quote_indices[-1] + 1:].split()                                                                 

    return output

language-agnostic - 拆分字符串忽略引用的部分

13 回答 13

这是一次使用伪代码（又名 Python）的代码：-P

Related

Reference