bash - 此 AWK 的性能问题 - 此代码在 5GB 输入文件上大约需要 200 分钟

Question

下面的代码在 5GB 文件上运行，它消耗 99% 的 CPU，我想知道我是否在做一些非常错误的事情，或者有什么可以缩短执行时间。

2013-04-03 08:54:19,989 信息 [记录器] 2013-04-03T08:54:19.987-04:00PCMC.common.manage.springUtil<log-message-body><headers><fedDKPLoggingContext id="DKP_DumpDocumentProperties " type="context.generated.FedDKPLoggingContext"><logFilter>7</logFilter><logSeverity>255</logSeverity><schemaType>PCMC.MRP.DocumentMetaData</schemaType><UID>073104c-4e -4ce-bda-694344ee62</UID><consumerSystemId>JTR</consumerSystemId><consumerLogin>jbserviceid</consumerLogin><logLocation>成功完成服务</logLocation></fedDKPLoggingContext></headers><有效载荷>0</有效载荷></日志消息正文>

这是我正在使用的代码。我也尝试了 gz 格式，但都是徒劳的。我在下面的命令中从 bash 中调用这个 awk。

awk -f mytest.awk <(gzip -dc 扫描文件.$yesterday.gz)| gzip > tem.gz

cat mytest.awk
#!/bin/awk -f

function to_ms (time, time_ms, s) {
    split(time, s, /:|\,/ )
    time_ms = (s[1]*3600+s[2]*60+s[3])*1000+s[4]
    #printf ("%s\n", newtime)
    return time_ms
}

{
   stid = gensub(/.*UID&amp;gt;([^&]+).*/,"\\1","")
}

(stid in starttime) {
    etime = to_ms($2)
    endtime[stid] = etime
    docid[stid] = gensub(/.*id="([^""]+).*/,"\\1","")
    consumer[stid]= gensub(/.*schemaType&amp;gt;PNC.([^.]+).*/,"\\1","")
    state[stid]= gensub(/.*lt;logLocation&amp;gt;([^'' ]+).*/,"\\1","")
    next
}

{
    stime = to_ms($2)
    starttime[stid] = stime
    st_hour[stid] = stime/(60*60*1000)
    timestamp[stid] = $1" "$2
}

END {
    print "Document,Consumer,Hour,ResponseTime,Timestamp,State"
    for (x in starttime) {
        for (y in endtime) {
            if (x==y) {
                diff = (endtime[y]-starttime[x])
                st = sprintf("%02d", st_hour[x])
                print docid[y], consumer[y], st":00", diff, timestamp[x], state[y] |"sort -k3"
                delete starttime[x]
                delete endtime[y]
                delete docid[y]
                delete consumer[y]
                delete timestamp[x]
                delete state[y]
            } 
        }
    }
}

score 3 · Accepted Answer

在该END部分中，它总是通过内部for-loop，即使y找到该项目然后从endtime数组中删除也是如此。我建议使用 abreak跳出内部循环。

另一方面（如我所见）根本不需要内部循环！它尝试在关联数组中查找具有已知键的元素。

此外，我可能会建议不删除找到的项目。在关联数组中查找一个项目是在恒定时间内完成的（取决于生成哈希键的算法和生成多少重复项目），因此从这样的数组中删除项目不一定会加快过程，但删除项目肯定会放慢速度。

所以我可能会建议使用这个：

for (x in starttime) {
    if (x in endtime) {
        diff = (endtime[x]-starttime[x])
        st = sprintf("%02d", st_hour[x])
        print docid[x], consumer[x], st":00", diff, timestamp[x], state[x] |"sort -k3"
    }
}

使用gzip甚至会消耗更多的 CPU 资源，但可以节省一些 I/O 带宽。

score 3 · Accepted Answer

假设每个 stid 只有一个结束时间 - 不要建立一组开始时间和一组结束时间，然后循环遍历它们，只需在达到结束时间时处理 stid。即不像你今天这样：

{ stid = whatever }
stid in starttime {
   populate endtime[stid]
   next
}
{ populate starttime[stid] }
END {
   for (x in starttime) {
      for (y in endtime) {
          if (x == y) {
              stid = x
              process endtime[stid] - starttime[stid]
          }
      }
   }
}

但是这个：

{ stid = whatever }
stid in starttime {
   process to_ms($2) - starttime[stid]
   delete starttime[stid]
   next
}
{ populate starttime[stid] }

如果您不能这样做，例如由于有多个具有相同 stid 的记录，并且您想从第一个和最后一个记录中获取时间戳，则将 END 部分中的循环更改为仅循环遍历您的 stid得到了结束时间（因为你已经知道他们有相应的开始时间），而不是尝试在开始时间和结束时间的那些大规模循环中找到所有 stid，例如：

{ stid = whatever }
stid in starttime {
   populate endtime[stid]
   next
}
{ populate starttime[stid] }
END {
   for (stid in endtime) {
      process endtime[stid] - starttime[stid]
   }
}

无论采用哪种方法，您都应该看到性能大幅提升。

score 0 · Accepted Answer

@ Ed，第一种方法没有给我预期的结果。这就是所做的

# end time and diff 
(stid in starttime) 
{ etime = to_ms($2) 
diff = etime - stime
 print diff,stid 
delete starttime[stid] 
next } 
# Populate starttime
 { 
stime = to_ms($2) 
starttime[stid] = stime 
st_hour[stid] = stime/(60*60*1000)
 }

o/p 就像 left 这个 sud 以毫秒和 stid 出现。

561849 c858591f-e01b-4407-b9f9-48302b65c383 562740 c858591f-e01b-4407-b9f9-48302b65c383 563629 56c71ef3-d952-4261-9711-16b18a32c6ba 564484 56c71ef3-d952-4261-9711-16b18a32c6ba

bash - 此 AWK 的性能问题 - 此代码在 5GB 输入文件上大约需要 200 分钟

3 回答 3

Related

Reference