python - 优化 shell 脚本 (bash) 以提高性能

Question

我有一个用于处理文本文件的 bash 脚本：

#/bin/bash

dos2unix sourcefile.txt

cat sourcefile.txt | grep -v '\/' | grep -v '\-\-' | grep -v '#' | grep '[A-Za-z]\*' > modified_sourcefile.txt

mv modified_sourcefile.txt sourcefile.txt
#
# Read the sourcefile file one line by line and iterate...
#

while read line
do

 echo $line | grep -v '\/' | grep -v '\-\-' | grep -v '#'
 if [ $? -eq 0 ]
 then

   # echo "Current Line is " $line ";"
    char1=`echo ${line:0:1}`
   # echo "1st char is " $char1

  if [ -n "$char1" ]
   # if a blank-line, neglect the line.
    then
        # echo "test passed"
        var1=`echo $line | cut -d '*' -f 1`
    var2=`echo $line | cut -d '*' -f 1`
    var3=`echo $line | cut -d - -f 1`
        var4=`echo $line | cut -d '*' -f 1`
        var5=`echo $line | cut -d '*' -f 2`
        var6=`echo $line | cut -d - -f 1`
        var7=`echo $line | cut -d '*' -f 3 `


        table1sql="INSERT IGNORE INTO table1 (id,name,active_yesno,category,description,
           last_modified_by,last_modified_date_time) SELECT ifnull(MAX(id),0)+1,'$var1',1,
           '$var2','$var3','admin',NOW() FROM table1;"

    echo $table1sql >> result.txt


    privsql="INSERT IGNORE INTO table2 (id,name,description,active_yesno,group_code,
             last_modified_by,last_modified_date_time) SELECT ifnull(MAX(id),0)+1,'$var1',
         '$var3',1,'$var2','admin',NOW() FROM table2;"

    echo $privsql >> result.txt     


    table1privmapsql="INSERT IGNORE INTO table1_table2_map (id,table1_id,table2_id,
                  last_modified_by,last_modified_date_time) SELECT ifnull(MAX(id),0)+1,
                  (select id from table1 where name='$var1'),(select id from table2 where name='$var1'),'admin',NOW() FROM table1_table2_map;"
    echo $table1privmapsql >> result.txt

        privgroupsql="INSERT IGNORE INTO table2_group (id,name,category,active_yesno,last_modified_by,
                      last_modified_date_time) SELECT ifnull(MAX(id),0)+1,'tablegrp','$pgpcode',1,'admin',NOW() FROM table2_group;"

        echo $privgroupsql >> result.txt


    privprivgrpsql="INSERT IGNORE INTO table2_table2group_map (id,table2_id,table2_group_id,
                        last_modified_by,last_modified_date_time) SELECT ifnull(MAX(id),0)+1,
                        (select id from table2 where name='$var1'),(select id from table2_group where name='tablegrp'),'admin',NOW() FROM table2_table2group_map;"
        echo $privprivgrpsql >> result.txt              

    rolesql="INSERT IGNORE INTO role (id,name,active_yesno,security_domain_id,last_modified_by,last_modified_date_time) 
                 SELECT (select ifnull(MAX(id),0)+1 from role),'$rolename',1, sd.id ,'admin',NOW() 
                 FROM security_domain sd WHERE sd.name = 'General';"

        echo $rolesql >> result.txt

    fi                  
 fi                        
done < "sourcefile.txt"

问题是 sourcefile.txt 有超过 11000 行。所以大约需要 25 分钟才能完成 :-( 。

有更好的方法吗？

sourcefile.txt 的内容：

AAA-something*LOCATION-some_where*ABC

score 5 · Accepted Answer

为了使这个脚本更快，您必须尽量减少对外部命令的调用，并在可能的情况下使用 bash。

阅读本文以了解什么是无用的命令。
阅读本文以了解如何使用 bash 来操作字符串。
将重复值（var1，var2，var4）分配替换为单个值。

在优化cut时您可以替换

var1=`echo $line | cut -d '*' -f 1`

至

var1="${line%%\**}"

和

var5=`echo $line | cut -d '*' -f 2`

至

var5="${line%\**}"
var5="${var5##*\*}"

也许它不是那么可读，但比 cut 工作得快得多。

还

 echo $line | grep -v '\/' | grep -v '\-\-' | grep -v '#'

可以替换为类似的东西：

 if [[ "$line" =~ ([/#]|--) ]]; then :; else 
    # all code inside "if [ $? -eq 0 ]"
 fi

score 4 · Accepted Answer

shell 脚本天生就很慢，尤其是当它们使用大量像您这样的外部命令时。造成这种情况的最大原因是生成外部进程相当慢，而且你会这样做很多次。

如果您真的在对数据进行高性能处理，您应该编写 Perl 或 Python 脚本来完成您需要的操作，而无需产生任何外部进程：no dos2unix、no grep、nocut或类似的东西。

Perl（和 Python）也完全能够直接与数据库对话和插入数据，也无需使用外部命令。

如果你做得对，我预测使用 Perl 的处理性能将至少比现在快 100 倍。

如果你对 Perl 没问题，你可以从这样的开始，然后根据自己的喜好进行调整：

#!/usr/bin/perl -w

use strict;
use warnings;

open FILE, "sourcefile.txt" or die $!;
open RESULT, ">>result.txt" or die $!;
while (my $line = <FILE>) {
    # ignore lines with /, -- or #: 
    next if $line =~ m{/|--|#};
    my ($var1, $var2, $var3, $var4, $var5) =
        ($line =~ /^(\w+)-(\w+)\*(\w+)-(\w+)\*(\w+)/);
    # ignore line if regex did not match:
    next unless $var1 and $var2 and $var3 and $var4 and $var5;
    print RESULT "some sql stmt. using $var1, $var2, etc";
    print RESULT "some other sql using $var1, $var2, etc";
    # ...
}
close RESULT;
close FILE;

score 1 · Accepted Answer

在优化之前，配置文件！了解如何使用time命令。找出脚本的哪一部分花费的时间最多，然后将精力放在那里。

话虽如此，我认为多次通过grep会减慢速度。

这个：

cat sourcefile.txt | grep -v '\/' | grep -v '\-\-' | grep -v '#' | grep '[A-Za-z]\*'

可以用这个代替：

grep '[A-Za-z]\*' sourcefile.txt | grep -v -e '\/' -e '\-\-' -e '#'

python - 优化 shell 脚本 (bash) 以提高性能

3 回答 3

Related

Reference