regex - 正则表达式拆分 CSV

Question

我知道这个（或类似的）已经被问过很多次了，但是在尝试了很多可能性之后，我一直无法找到一个 100% 有效的正则表达式。

我有一个 CSV 文件，我试图将其拆分为一个数组，但遇到两个问题：带引号的逗号和空元素。

CSV 看起来像：

123,2.99,AMO024,Title,"Description, more info",,123987564

我尝试使用的正则表达式是：

thisLine.split(/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/)

唯一的问题是，在我的输出数组中，第 5 个元素以 123987564 的形式出现，而不是空字符串。

score 64 · Accepted Answer

描述

而不是使用拆分，我认为简单地执行匹配并处理所有找到的匹配会更容易。

该表达式将：

用逗号分隔您的示例文本
将处理空值
将忽略双引号逗号，前提是双引号没有嵌套
从返回值中修剪分隔逗号
从返回的值中修剪周围的引号

正则表达式： (?:^|,)(?=[^"]|(")?)"?((?(1)[^"]*|[^,"]*))"?(?=,|$)

在此处输入图像描述

例子

示例文本

123,2.99,AMO024,Title,"Description, more info",,123987564

使用非 java 表达式的 ASP 示例

Set regEx = New RegExp
regEx.Global = True
regEx.IgnoreCase = True
regEx.MultiLine = True
sourcestring = "your source string"
regEx.Pattern = "(?:^|,)(?=[^""]|("")?)""?((?(1)[^""]*|[^,""]*))""?(?=,|$)"
Set Matches = regEx.Execute(sourcestring)
  For z = 0 to Matches.Count-1
    results = results & "Matches(" & z & ") = " & chr(34) & Server.HTMLEncode(Matches(z)) & chr(34) & chr(13)
    For zz = 0 to Matches(z).SubMatches.Count-1
      results = results & "Matches(" & z & ").SubMatches(" & zz & ") = " & chr(34) & Server.HTMLEncode(Matches(z).SubMatches(zz)) & chr(34) & chr(13)
    next
    results=Left(results,Len(results)-1) & chr(13)
  next
Response.Write "<pre>" & results

使用非 java 表达式匹配

第 0 组获取包含逗号的整个子字符串
第 1 组在使用时获取引号
第 2 组获取不包括逗号的值

[0][0] = 123
[0][1] = 
[0][2] = 123

[1][0] = ,2.99
[1][1] = 
[1][2] = 2.99

[2][0] = ,AMO024
[2][1] = 
[2][2] = AMO024

[3][0] = ,Title
[3][1] = 
[3][2] = Title

[4][0] = ,"Description, more info"
[4][1] = "
[4][2] = Description, more info

[5][0] = ,
[5][1] = 
[5][2] = 

[6][0] = ,123987564
[6][1] = 
[6][2] = 123987564

score 35 · Accepted Answer

对此进行了一段时间的研究并提出了以下解决方案：

(?:,|\n|^)("(?:(?:"")*[^"]*)*"|[^",\n]*|(?:\n|$))

在这里试试吧！

该解决方案处理“不错”的 CSV 数据，例如

"a","b",c,"d",e,f,,"g"

0: "a"
1: "b"
2: c
3: "d"
4: e
5: f
6:
7: "g"

还有更丑的东西，比如

"""test"" one",test' two,"""test"" 'three'","""test 'four'"""

0: """test"" one"
1: test' two
2: """test"" 'three'"
3: """test 'four'"""

这是它如何工作的解释：

(?:,|\n|^)      # all values must start at the beginning of the file,  
                #   the end of the previous line, or at a comma  
(               # single capture group for ease of use; CSV can be either...  
  "             # ...(A) a double quoted string, beginning with a double quote (")  
    (?:         #        character, containing any number (0+) of  
      (?:"")*   #          escaped double quotes (""), or  
      [^"]*     #          non-double quote characters  
    )*          #        in any order and any number of times  
  "             #        and ending with a double quote character  

  |             # ...or (B) a non-quoted value  

  [^",\n]*      # containing any number of characters which are not  
                # double quotes ("), commas (,), or newlines (\n)  

  |             # ...or (C) a single newline or end-of-file character,  
                #           used to capture empty values at the end of  
  (?:\n|$)      #           the file or at the ends of lines  
)

score 13 · Accepted Answer

几个月前我为一个项目创建了这个。

 ".+?"|[^"]+?(?=,)|(?<=,)[^"]+

正则表达式可视化

它在 C# 中工作，当我选择 Python 和 PCRE 时，Debuggex 很高兴。Javascript 无法识别这种形式的 Proceeded By ?<=...。

对于您的价值观，它将在

123
,2.99
,AMO024
,Title
"Description, more info"
,
,123987564

请注意，引号中的任何内容都没有前导逗号，但空值用例需要尝试与前导逗号匹配。完成后，根据需要修剪值。

我使用RegexHero.Net来测试我的正则表达式。

score 12 · Accepted Answer

我迟到了，但以下是我使用的正则表达式：

(?:,"|^")(""|[\w\W]*?)(?=",|"$)|(?:,(?!")|^(?!"))([^,]*?)(?=$|,)|(\r\n|\n)

此模式具有三个捕获组：

引用单元格的内容
未引用单元格的内容
一条新线

此模式处理以下所有内容：

没有任何特殊功能的正常单元格内容： 一、二、三
包含双引号的单元格（" 转义为 ""）： 没有引号，"a ""quoted"" thing",end
单元格包含换行符： 一、二\n三、四
具有内部引号的正常单元格内容： 一，二“三，四
单元格包含引号，后跟逗号： 一、二、三、四、五

请参阅使用中的此模式。

如果您使用的是功能更强大的正则表达式，带有命名组和后视，我更喜欢以下内容：

(?<quoted>(?<=,"|^")(?:""|[\w\W]*?)*(?=",|"$))|(?<normal>(?<=,(?!")|^(?!"))[^,]*?(?=(?<!")$|(?<!"),))|(?<eol>\r\n|\n)

请参阅使用中的此模式。

编辑

(?:^"|,")(""|[\w\W]*?)(?=",|"$)|(?:^(?!")|,(?!"))([^,]*?)(?=$|,)|(\r\n|\n)

只要您不使用 Javascript，这种稍微修改的模式就可以处理第一列为空的行。出于某种原因，Javascript 将使用此模式省略第二列。我无法正确处理这种极端情况。

score 9 · Accepted Answer

我也需要这个答案，但我发现这些答案虽然内容丰富，但对于其他语言来说有点难以理解和复制。这是我为 CSV 行中的单个列想出的最简单的表达式。我不分裂。我正在构建一个正则表达式来匹配 CSV 中的一列，所以我没有拆分行：

("([^"]*)"|[^,]*)(,|$)

这匹配 CSV 行中的单个列。表达式的第一部分"([^"]*)"是匹配引用的条目，第二部分[^,]*是匹配未引用的条目。然后是 a,或 end of line $。

以及随附的调试表达式来测试表达式。

https://www.debuggex.com/r/s4z_Qi2gZiyzpAhx

score 5 · Accepted Answer

我个人尝试了许多 RegEx 表达式，但没有找到适合所有情况的完美表达式。

我认为正则表达式很难正确配置以正确匹配所有情况。虽然很少有人不喜欢命名空间（我也是其中的一员），但我提出了一些属于 .Net 框架的内容，并在所有情况下始终给我正确的结果（主要是很好地管理每个双引号情况）：

Microsoft.VisualBasic.FileIO.TextFieldParser

在这里找到它：StackOverflow

使用示例：

TextReader textReader = new StringReader(simBaseCaseScenario.GetSimStudy().Study.FilesToDeleteWhenComplete);
Microsoft.VisualBasic.FileIO.TextFieldParser textFieldParser = new TextFieldParser(textReader);
textFieldParser.SetDelimiters(new string[] { ";" });
string[] fields = textFieldParser.ReadFields();
foreach (string path in fields)
{
    ...

希望它可以帮助。

score 4 · Accepted Answer

在 Java 中，这种模式",(?=([^\"]*\"[^\"]*\")*(?![^\"]*\"))" 几乎对我有用：

String text = "\",\",\",,\",,\",asdasd a,sd s,ds ds,dasda,sds,ds,\"";
String regex = ",(?=([^\"]*\"[^\"]*\")*(?![^\"]*\"))";
Pattern p = Pattern.compile(regex);
String[] split = p.split(text);
for(String s:split) {
    System.out.println(s);
}

输出：

","
",a,,"

",asdasd a,sd s,ds ds,dasda,sds,ds,"

缺点：不起作用，当列有奇数个引号时:(

score 3 · Accepted Answer

将 JScript 用于经典 ASP 页面的优势在于，您可以使用为 JavaScript 编写的众多库之一。

像这样：https ://github.com/gkindel/CSV-JS 。下载它，将其包含在您的 ASP 页面中，并用它解析 CSV。

<%@ language="javascript" %>

<script language="javascript" runat="server" src="scripts/csv.js"></script>
<script language="javascript" runat="server">

var text = '123,2.99,AMO024,Title,"Description, more info",,123987564',
    rows = CSV.parse(line);

    Response.Write(rows[0][4]);
</script>

score 3 · Accepted Answer

Aaaand 另一个答案在这里。:) 因为我不能让其他人很好地工作。

我的解决方案都处理转义引号（两次出现），并且它不包括匹配中的分隔符。

请注意，我一直在匹配'而不是"我的场景，但只需在模式中替换它们以获得相同的效果。

在这里（/x如果您使用下面的注释版本，请记住使用“忽略空格”标志）：

# Only include if previous char was start of string or delimiter
(?<=^|,)
(?:
  # 1st option: empty quoted string (,'',)
  '{2}
  |
  # 2nd option: nothing (,,)
  (?:)
  |
  # 3rd option: all but quoted strings (,123,)
  # (included linebreaks to allow multiline matching)
  [^,'\r\n]+
  |
  # 4th option: quoted strings (,'123''321',)
  # start pling
  ' 
    (?:
      # double quote
      '{2}
      |
      # or anything but quotes
      [^']+
    # at least one occurance - greedy
    )+
  # end pling
  '
)
# Only include if next char is delimiter or end of string
(?=,|$)

单行版本：

(?<=^|,)(?:'{2}|(?:)|[^,'\r\n]+|'(?:'{2}|[^']+)+')(?=,|$)

正则表达式可视化（如果它有效，debux 现在似乎有问题 - 否则请点击下一个链接）

调试演示

正则表达式 101 示例

score 2 · Accepted Answer

另一个带有一些额外功能的答案，例如支持包含转义引号和 CR/LF 字符（跨越多行的单个值）的引用值。

注意： 虽然下面的解决方案可能适用于其他正则表达式引擎，但按原样使用它需要您的正则表达式引擎将使用相同名称的多个命名捕获组视为一个捕获组。（.NET 默认执行此操作）

当 CSV 文件/流的多行/记录（匹配RFC 标准 4180）传递给下面的正则表达式时，它将为每个非空行/记录返回一个匹配项。每个匹配项将包含一个名为的捕获组，该捕获组Value包含该行/记录中的捕获值（如果行/记录末尾有一个开放引号，则可能是一个OpenValue捕获组）。

这是注释模式（在 Regexstorm.net 上测试）：

(?<=\r|\n|^)(?!\r|\n|$)                       // Records start at the beginning of line (line must not be empty)
(?:                                           // Group for each value and a following comma or end of line (EOL) - required for quantifier (+?)
  (?:                                         // Group for matching one of the value formats before a comma or EOL
    "(?<Value>(?:[^"]|"")*)"|                 // Quoted value -or-
    (?<Value>(?!")[^,\r\n]+)|                 // Unquoted value -or-
    "(?<OpenValue>(?:[^"]|"")*)(?=\r|\n|$)|   // Open ended quoted value -or-
    (?<Value>)                                // Empty value before comma (before EOL is excluded by "+?" quantifier later)
  )
  (?:,|(?=\r|\n|$))                           // The value format matched must be followed by a comma or EOL
)+?                                           // Quantifier to match one or more values (non-greedy/as few as possible to prevent infinite empty values)
(?:(?<=,)(?<Value>))?                         // If the group of values above ended in a comma then add an empty value to the group of matched values
(?:\r\n|\r|\n|$)                              // Records end at EOL

这是没有所有注释或空格的原始模式。

(?<=\r|\n|^)(?!\r|\n|$)(?:(?:"(?<Value>(?:[^"]|"")*)"|(?<Value>(?!")[^,\r\n]+)|"(?<OpenValue>(?:[^"]|"")*)(?=\r|\n|$)|(?<Value>))(?:,|(?=\r|\n|$)))+?(?:(?<=,)(?<Value>))?(?:\r\n|\r|\n|$)

[这是来自 Debuggex.com 的可视化][3]（捕获组以清晰命名）：![Debuggex.com 可视化][4]

关于如何使用正则表达式模式的示例可以在我对类似问题的回答中找到，或者在C# pad上，或者在这里。

score 1 · Accepted Answer

如果您知道不会有空字段 (,,)，那么这个表达式很有效：

("[^"]*"|[^,]+)

如下例所示...

Set rx = new RegExp
rx.Pattern = "(""[^""]*""|[^,]+)"
rx.Global = True
Set col = rx.Execute(sText)
For n = 0 to col.Count - 1
    if n > 0 Then s = s & vbCrLf
    s = s & col(n)
Next

但是，如果您预计有一个空字段并且您的文本相对较小，那么您可能会考虑在解析之前用空格替换空字段以确保它们被捕获。例如...

...
Set col = rx.Execute(Replace(sText, ",,", ", ,"))
...

如果您需要保持字段的完整性，您可以恢复逗号并测试循环内的空格。这可能不是最有效的方法，但它可以完成工作。

score 1 · Accepted Answer

我正在使用这个，它适用于逗号分隔符和双引号转义。通常这应该可以解决您的问题：

/(?<=^|,)(\"(?:[^"]+|"")*\"|[^,]*)(?:$|,)/g

score 1 · Accepted Answer

我用这个表达。它考虑了我遇到的逗号后的空格。

(?:,"|^"|, ")(""|[\w\W]*?)(?=",|"$)|(?:,(?!")|^(?!"))([^,]*?)(?=$|,)|(\r\n|\n)

score 0 · Accepted Answer

我也需要从 SQL 插入语句中拆分 CSV 值。

就我而言，我可以假设字符串用单引号括起来，而数字则没有。

csv.split(/,((?=')|(?=\d))/g).filter(function(x) { return x !== '';});

由于某些可能显而易见的原因，此正则表达式会产生一些空白结果。我可以忽略这些，因为我的数据中的任何空值都表示为...,'',...and not ...,,...。

score 0 · Accepted Answer

如果我尝试使用“g”标志在http://regex101.com上由@chubbsondubs 发布的正则表达式，则存在仅包含“，”或空字符串的匹配项。使用这个正则表达式：
(?:"([^"]*)"|([^,]*))(?:[,])
我可以匹配 CSV 的部分（包括引用的部分）。（该行必须以 ',' 结尾，否则无法识别最后一部分。）
https://regex101.com/r/dF9kQ8/4
如果 CSV 看起来像：
"",huhu,"hel lo",world,
有 4 个匹配项：
''
'huhu'
'你好'
'世界'

score 0 · Accepted Answer

,?\s*'.+?'|,?\s*".+?"|[^"']+?(?=,)|[^"']+

此正则表达式适用于单引号和双引号，也适用于另一个引号！

score 0 · Accepted Answer

这个匹配我在 c# 中需要的所有内容：

(?<=(^|,)(?<quote>"?))([^"]|(""))*?(?=\<quote>(?=,|$))

带引号
让新行
让带引号的字符串中的双引号
让逗号在引用的字符串中

score -12 · Accepted Answer

将单引号值与其中的转义 [doubled] 单引号匹配的正确正则表达式是：

'([^n']|(''))+'

regex - 正则表达式拆分 CSV

18 回答 18

描述

例子

Related

Reference