1

我有一些.toml内容结构可预测的文件,例如:

key1 = "someID"
key2 = "someVersionNumber"
key3 = "someTag"
key4 = "someOtherTag"
key5 = [] #empty array, sometimes contains strings
key6 = "long text"
key7 = "more text"
key8 = """
- text
- more text
- so much text
"""

我想像这样将其转换为 CSV:

"key1","key2","key3","key4","key5","key6","key7","key8"
"someID","someVersionNumber","someTag","someOtherTag","","long text","more text", "- text- more text- so much text"

我可以用几行 bash 命令来做到这一点吗?

如果我想将 CSV 的所有行合并为一个,例如

"key1","key2","key3","key4","key5","key6","key7","key8"
"someID","someVersionNumber","someTag","someOtherTag","","long text","more text", "- text- more text- so much text"
"someID","someVersionNumber","someTag","someOtherTag","","long text","more text", "- text- more text- so much text"
"someID","someVersionNumber","someTag","someOtherTag","","long text","more text", "- text- more text- so much text"

...即输出将是每个.toml文件一行 CSV 加上顶部的标题(总是相同的 CSV 标题和列数,因为.toml文件是可预测的)。

我在看 sed、awk 还是更简单的东西?我已经查看了一些相关的问题,但我觉得我必须丢失一些东西,因为我得到了太多的功能:

提取文本文件中两点之间的数据

在 bash 中使用 awk/sed 解析 json 以获取键值对

4

3 回答 3

1

如果只有一个输入文件,我会使用 Perl 单行。不幸的是,结果相当复杂:

perl -pe 'if(/"""/&&s/"""/"/.../"""/&&s/"""/"\n/){s/[\n\r]//;};if(/ = \[([^]]*)]/){$r=$1eq""?"\"\"":$1=~s/"\s*,\s*"/ /gr;s/ = \[([^]]*)]/ = $r/};s/"\s*#[^"\n]*$/"/' one.toml | perl -ne 'if(/^([^"]+) = "(.*)"/){push@k,$1;push@v,"\"$2\""}END{print((join",",@k),"\n",join",",@v)}'

如果我们需要同时操作多个 ( *) 文件,情况只会变得更糟:

perl -ne 'if(/"""/&&s/"""/"/.../"""/&&s/"""/"\n/){s/[\n\r]//;};if(/ = \[([^]]*)]/){$r=$1eq""?"\"\"":$1=~s/"\s*,\s*"/ /gr;s/ = \[([^]]*)]/ = $r/};s/"\s*#[^"\n]*$/"/;print;print"-\n"if eof' *.toml | perl -ne 'if(/^-$/){push@o,join",",@k if scalar@o==0;push@o,join",",@v;@k=@v=()};if(/^([^"]+) = "(.*)"/){push@k,$1;push@v,"\"$2\""}END{print join"\n",@o}'

这两个因素需要结构化的脚本。这是在 Perl 中,但同样可以在 Python 或您喜欢的任何语言中完成:

#!/usr/bin/env perl
use strict; use warnings; my @output;

foreach my $filename (@ARGV) {
    my $content, my @lines, my $replace, my @keys, my @values;
    open my $fh, "<:encoding(utf8)", $filename or die "Could not open $filename: $!";
    {local $/; $content = <$fh>;}
    $content =~ s/"""([^"]*)"""/'"' . $1=~s#[\r\n]##rg . '"'/ge;
    @lines = split (/[\r\n]/, $content);
    foreach my $line (@lines) {
        if ($line =~ m/ = \[([^]]*)]/) {
            $replace = $1 eq "" ? '""' : $1 =~ s/"\s*,\s*"/ /gr;
            $line =~ s/ = \[([^]]*)]/ = $replace/
        }
        $line =~ s/"\s*#[^"]*$/"/;
        $line =~ m/^([^"]+) = "(.*)"/;
        push @keys, $1;
        push @values, '"' . $2 . '"'
    }
    push @output, join ",", @keys if scalar @output == 0;
    push @output, join ",", @values
}
print join "\n", @output

笔记:

大部分复杂性是由于必须处理数组(!)、注释和多行字符串。每个都需要一些预处理,这就是解决方案长度的大部分内容。此外,还需要有关可能的极端情况以及如何处理它们的附加信息(例如,如何在 CSV 中拟合字符串数组)。所有这些都强调了输入数据质量和一致性的重要性。所提出的解决方案绝不是完整的或稳健的,因为它确实对输入数据和所需的输出格式做出了一些假设。以下是我解决上述问题的方法:

  • 应该只是字符串,因为它们在发布的示例文件中。该脚本不处理数字、日期和布尔值。
  • 数组可以是空的,也可以是[]字符串数组["my", "array"]。在 OP 没有明确规范的情况下,它们转换为单个字符串,该字符串是所有元素字符串的串联。数组中不允许换行,数组也不能包含其他数组。
  • 仅当注释在字符串值之后内联时才被处理。没有仅注释行。
  • 不处理缩进空行节标题

测试运行:

$ perl toml-to-csv.pl *.toml
"someID1","someVersionNumber1","someTag1","someOtherTag1","","long text1","more text1","- text- more text- so much text"
"someID2","someVersionNumber2","someTag2","someOtherTag2","Array","long text2","more text2","- text- more text- so much text"
"someID3","someVersionNumber3","someTag3","someOtherTag3","My array","long text3","more text3","- text- more text- so much text"
于 2019-05-22T16:50:09.220 回答
1
$ cat tst.awk
BEGIN { OFS="," }
{
    sub(/[[:space:]]*#[^"]*$/,"")
    key = val = $0
}

sub(/^[[:alnum:]]+[[:space:]]+=[[:space:]]+/,"",val) {
    sub(/[[:space:]]+.*/,"",key)
    keys[++numKeys] = key
    gsub(/^("""|\[])$|^"|"$/,"",val)
    vals[numKeys] = val
}

/^-[[:space:]]+/ {
    vals[numKeys] = vals[numKeys] val
}

/^"""$/ {
    if ( !doneHdr++ ) {
        for (keyNr=1; keyNr<=numKeys; keyNr++) {
            printf "\"%s\"%s", keys[keyNr], (keyNr<numKeys ? OFS : ORS)
        }
    }
    for (keyNr=1; keyNr<=numKeys; keyNr++) {
        printf "\"%s\"%s", vals[keyNr], (keyNr<numKeys ? OFS : ORS)
    }
}

.

$ awk -f tst.awk file
"key1","key2","key3","key4","key5","key6","key7","key8"
"someID","someVersionNumber","someTag","someOtherTag","","long text","more text","- text- more text- so much text"

替换file为您的输入文件列表。

sub(/[[:space:]]*#[^"]*$/,"")用来删除以开头的注释的正则表达式#意味着您不能在注释中使用双引号。我这样做是为了防止#数据字符串中出现变化。随意找出更好的正则表达式或其他方法来处理您的评论。

于 2019-05-21T17:04:31.493 回答
0
于 2022-02-28T03:50:01.357 回答