bash - bash 递归 CSV 解析器

Question

我已将此问题移至代码审查

我在 bash 中编写了一个递归下降解析器。

我想知道你是否可以指出一些对我有帮助的东西。它支持反斜杠转义和带解析错误报告的引用字段。

该脚本cut在某些方面的工作方式类似于.. 从文件中获取输入或stdin允许用户选择他们想要打印的行和字段。分别使用选项-l和-f。可以打印字段列表，并且可以使用选项 --list '1 2 3 4 5' 和 --list-seperator $'\n' 指定该列表的自定义分隔符。

#!/bin/bash

shopt -s extglob; # we need maxiumum expression capabilities enabled

# option variables
declare list='' delim=$'\n' field='' lineNum=1;

while [[ ${1:0:1} = '-' ]]; do # Parse user arguments

    case $1 in
        -f)
            field=$2; shift 2;
        ;;
        -l)
            lineNum=$2; shift 2;
        ;;
        --list-seperator)
            delim="$2"; shift 2;
        ;;
        --list)
            list="$2"; shift 2;
        ;;
        *) break;;
    esac

done

# open a user supplied file on stdin
[[ -e "$1" ]] && exec 0<$1;

# data from 'read/getline'
declare input='';

# we are using sed to optimize input, the command just prints the desired line
read -r input < <(sed -n ${lineNum}p) 
# why doesn't the above work as a pipe into read?

# range of this line
declare strLen=${#input} value='';

# data processing variables
declare symbol='' value='';

# REGEX symbol "classes"
declare nothing='' comma='[,]' quote='["]' backslash='[\]' text='[^,\"]';

# integers:
declare -i iPos=-1 tPos=0;

# output array:
declare -a items=();

NextSymbol() {

    symbol="${input:$((++iPos)):1}"; # get next char from string

    (( iPos < strLen )); # return false if we are out of range

}

Accept() {

    [[ -z "$symbol" && -z "$1" ]] && return 0; # accept "nothing/empty"

    # if you can meld the above line into the next line
    # let me know: pc.wiz.tt@gmail.com; this is some kind of bug!
    # becare careful because expect expects 'nothing' to be empty.
    # that's why it says 'end of input'

    [[ "$symbol" =~ ^$1$ ]] && NextSymbol && return 0; # match symbol
}

Expect() {
    Accept "$1" && return
    local msg="$0: parse failure: line $lineNum: expected symbol: "
    echo "$msg'${1:-end of input}'" >&2;
    echo "$0: found: '$symbol'" >&2;
    exit 1;
}

value() {

    while Accept $text; do # symbol will not be what we expect here
        value+=${input:$((iPos-1)):1}; # so get what we expect
    done

    Accept $nothing && { # this will only happen at end of the string
        value+=${input:$((iPos-1)):1} # get the last char
        pushValue; # push the data into the array
    }

}

pushValue() {
    items[tPos++]=$value;
    value=''; # clear value!
}

quote() {

    until [[ $symbol =~ $quote || -z $symbol ]]; do
        value+=$symbol;
        NextSymbol;
    done

    Expect $quote;

}

line() {

    Accept $quote && {
        quote
        line;
    }

    Accept $backslash && {
        value+=$symbol;
        NextSymbol;
        value;
        line;
    }

    Accept $comma && {
        pushValue;
        line;
    }

    Accept $text && {
        value=${input:$((iPos-1)):1};
        value;
        line;
    }

}

NextSymbol;
    line;
        Expect $nothing


[[ $field ]] && { # want field    
    echo "${items[field-1]}" # print the item
                             # (index justified for human use)
    exit;
}

[[ $list ]] && { # want list
    for index in $list; do
        echo -n "${items[index-1]}${delim}" # print the list 
                                            # (index justified for human use)
    done
    exit;
}

exit 1; # no field or list

score 2 · Accepted Answer

说真的，让它更快的方法是用一种不被解释的语言编写。

对于解析器来说，shell 可能不是一个糟糕的选择，原因与我不会用 6502 汇编语言或用 COBOL 编写操作系统的原因相同。

而且，老实说，awk不会那么好。它是顺序文本处理的理想选择，但将其应用于像解析器这样复杂的东西将是一个延伸。

有时，您只需要重新考虑您正在使用的工具。如果你做不到它是像 C 这样的编译语言，至少要考虑一下像 Python 这样的面向字节码的语言。

bash - bash 递归 CSV 解析器

1 回答 1

Related

Reference