0

I have a bash script that calls a program which generates a humongous amount of output. A lot of this data is coming from a Python package that I have not created and whose output I can't really control, nor interests me.

I tried to filter the output generated by that external Python package and redirect the "cleaned" output to a log file. If I used regular pipes and grep expressions, I lost many chunks of information. I read that is something that can actually happen with the redirections (1 and 2).

In order to fix that, I made the redirections like this:

#!/bin/bash
regexTxnFilterer="\[txn\.-[[:digit:]]+\]"
regexThreadPoolFilterer="\[paste\.httpserver\.ThreadPool\]"
bin/paster serve --reload --pid-file="/var/run/myServer//server.pid"  parts/etc/debug.ini 2>&1 < "/dev/null" | while IFS='' read -r thingy ; do
        if [[ ! "$thingy" =~ $regexTxnFilterer ]] && [[ ! "$thingy" =~ $regexThreadPoolFilterer ]]; then
                  echo "$thingy" >> "/var/log/myOutput.log" 
        fi
done

Which doesn't lose any information (at least not that I could tell) and filters the strings I don't need (using the two regular expressions above).

The issue is that it has rendered the application (the bin/paster thing I'm executing) unbearably slow. Is there any way to achieve the same effect but with a better performance?

Thank you in advance!

Update @2012-04-13: As shellter pointed out in one of the comments to this question, it may be useful to provide examples of the outputs I want to filter. Here's a bunch of them:

2012-04-13 19:30:37,996 DEBUG [txn.-1220917568] new transaction
2012-04-13 19:30:37,997 DEBUG [txn.-1220917568] commit <zope.sqlalchemy.datamanager.SessionDataManager object at 0xbf4062c>
2012-04-13 19:30:37,997 DEBUG [txn.-1220917568] commit
Starting server in PID 18262.
2012-04-13 19:30:38,292 DEBUG [paste.httpserver.ThreadPool] Started new worker -1269716112: Initial worker pool
2012-04-13 19:33:08,158 DEBUG [txn.-1244144784] new transaction
2012-04-13 19:33:08,158 DEBUG [txn.-1244144784] commit
2012-04-13 19:32:06,980 DEBUG [paste.httpserver.ThreadPool] Added task (0 tasks queued)
2012-04-13 19:32:06,980 INFO [paste.httpserver.ThreadPool] kill_hung_threads status: 10 threads (0 working, 10 idle, 0 starting) ave time N/A, max time 0.00sec, killed 0 workers

There's a few more different messages involving the ThreadPool though, but I couldn't catch any.

4

2 回答 2

3

For one thing -- you're reopening the log file every time you want to append a line. That's silly.

Instead of this:

while ...; do
   echo "foo" >>filename
done

Do this (which opens the output file on a new, non-stdout file handle, such that you still have a clear line to stdout should you wish to write to it):

exec 4>>filename
while ...; do
   echo "foo" >&4
done

It's also possible to redirect stdout for the whole loop:

while ...; do
   echo "foo"
done >filename

...notably, this will impact more than just the "echo" line, and thus have slightly different semantics from the original.


Or, better yet -- Configure the Python logging module to filter output to only what you care about, and don't bother with shell-script postprocessing at all.

If the version of Paste you're using is sufficiently similar to modern Pyramid, you can put this in your ini file (currently parts/etc/debug.ini):

[logger_paste.httpserver.ThreadPool]
level = INFO

[logger_txn]
level = INFO

...and anything below INFO level (including the DEBUG messages) will be excluded.

于 2012-04-16T22:50:22.393 回答
1

It may be faster to use a grep-based solution to this

#!/bin/bash
regexTxnFilterer="\[txn\.-[[:digit:]]+\]"
regexThreadPoolFilterer="\[paste\.httpserver\.ThreadPool\]"
bin/paster serve --reload --pid-file="/var/run/myServer//server.pid"  parts/etc/debug.ini 2>&1 < "/dev/null" | grep -vf <(echo "$regexTxnFilterer"; echo "$regexThreadPoolFilterer") >> "/var/log/myOutput.log"

Your loop may be slow because the echo "$thingy" >> "/var/log/myOutput.log" line is opening and closing the log file every time it executes. I wouldn't expect there to be a big performance difference between grep's regex matching and bash's, but if there was it wouldn't surprise me.


Late Edit

There's a far simpler way to fix the performance issue caused by opening/closing the output once per line. Why this didn't occur to me before, I have no idea. Just move the >> to outside your loop

#!/bin/bash
regexTxnFilterer="\[txn\.-[[:digit:]]+\]"
regexThreadPoolFilterer="\[paste\.httpserver\.ThreadPool\]"
bin/paster serve --reload --pid-file="/var/run/myServer//server.pid"  parts/etc/debug.ini 2>&1 < "/dev/null" | while IFS='' read -r thingy ; do
        if [[ ! "$thingy" =~ $regexTxnFilterer ]] && [[ ! "$thingy" =~ $regexThreadPoolFilterer ]]; then
                  echo "$thingy"
        fi
done  >> "/var/log/myOutput.log"

I can't see any compelling reason why this would be either faster or slower than the grep solution, but it's a lot closer to the original code and a little less cryptic.

于 2012-04-13T23:12:24.060 回答