1

我需要将文件“tmpcsv2”中的一组变量与“uniq_id”中的变量进行比较,我将详细介绍下面的文件。

tmpcsv2 -> 此文件由另一个脚本“script1”更新,每次运行“script1”都会更新(不附加)“tmpcsv2”中的新变量。没有。变量的个数可能是 1 并且可以达到 200。

eg:
2042344352
2470697747
2635527510
3667769962

uniq-id -> 这是一组固定的变量(大约 10 万个)

(Business Name,Job ID,Job Size)
biz,1000036446,225210640
biz,100006309,6710840
biz,1000069211,2084019000
biz,1000118720,34194040
biz,1000150241,212322636

我正在使用'for'循环+'if'来比较它们,如下所示,有没有更简单或更快(影响更小)的方法?当我运行它时,需要很长时间才能输出结果。打印命令仅用于测试,稍后将删除!

****Part of a bigger script****
amt=0
mjc=0
for jbid in `cat tmpcsv2` #Pick ID for match & calculation
do
    printf "Checking ID $jbid\n" >> Acsv3.tmp
    for bsid in `cat uniq_id` #Matching jobs & size calulation
    do
        ckid=`echo $bsid | cut -d "," -f2` #ckid is the ID to check
        jbsiz=`echo $bsid | cut -d "," -f3` #size of the ID
        if [ $jbid == $ckid ] 
        then
            printf "Matched at $ckid\n" #Print on Match found
            printf "Valid -> $jbid\n" >> Bcsv3.tmp
            ((mjc++)) #Increment Matched Job Count
            amt=$((amt+jbsiz)) #Add size of matched jobs
            break
        else
            printf "No Match at $cksid\n" #No matches
        fi
    done
    printf "Check for ID $jbid done\n" >> Acsv3.tmp
    printf "Matched $mjc jobs with combined size of $amt\n" >> Acsv3.tmp
done
****End of Comparision****
4

2 回答 2

1

shell 是处理这么多数据的错误工具,但它是可行的。这里最基本的错误是for. 通过在每次迭代中不重新打开文件,可以显着提高性能。

function main {
    # Variables used elsewhere should be initialized there, not localized here.
    typeset amt=0 mjc=0 jbid ckid jbsiz

    while IFS= read -r jbid; do
        printf 'Checking ID %s\n' "$jbid" >&3
        while IFS=, read -r _ ckid jbsiz _; do
            case $jbsiz in
                *[^[:digit:]]*|'')
                    # validation is important for subsequent arithmetic.
                    return 1
                    ;;
                "$ckid") # Assuming "cksid" was a typo. Replace if not.
                    printf 'Matched at %s\n' "$ckid"
                    printf 'Valid -> %s\n' "$jbid" >&4
                    (( mjc++, amt += jbsiz ))
                    break
                    ;;
                *)
                    printf 'No match at %s\n' "$ckid"
            esac
        done <uniqid
        {
            printf 'Check for ID %s done\n' "$jbid"
            printf 'Matched %s jobs with combined size of %s\n' "$mjc" "$amt"
        } >&3
    done <tmpcsv2 3>>Acsv3.tmp 4>>Bcsv3.tmp
}

最后,一个等效的 awk 脚本将大大优于这个 Bash 脚本,几乎任何其他语言也是如此。通过使用而不是读取循环,您还可以从 Bash 中获得更多性能,但是这种嵌套的读取循环逻辑使用回调mapfile来模拟有点草率。mapfile

于 2013-03-22T12:46:38.517 回答
0

我想出了这个,不确定是否可以缩短,但它确实运行得更快!任何帮助都感激不尽 !

************
while read -r line  #File read start
do
IFS=$","
val=$line
amt=0
mjc=0
cjc=0
for lsid in $val
do
    cksid=`echo $lsid | sed -e 's/*//g' -e 's/"//g'`
    printf "Checking for $cksid\n"
    ((cjc++)) #Count of jobs to check
    prsnt=`grep -w $cksid uniq_id`
    if [ $? -eq 0 ]
    then
        printf "Valid -> $cksid\n"
        jbsiz=`grep -w $prsnt | cut -d, -f2`
        (( mjc++, amt += jbsiz ))
        break
    else
        printf "No Data for $cksid\n"
    fi

done
done < tmpcsv2
***********
于 2013-03-24T06:49:49.907 回答