Clean up
Before adding functionality to the script, the existing script needs to be cleaned up — a lot.
I/O Redirection — Don't Repeat Yourself
When I see wall-to-wall I/O redirections like that, I want to cry — that isn't how you do it! You have three options to avoid all that:
for file in $(find . -name "*.csv" )
do
echo "===================================================="
echo "$file"
echo "----------------------------------------------------"
echo "Contacts BEFORE purge:"
wc -l $file | cut -d " " -f1
echo " "
cat $file | egrep -v "xxx|yyy|zzz" | grep -v -E -i '([0-z])\1{2,}' | uniq | sort -u > tmp_file
mv tmp_file $file ;
echo "Contacts AFTER purge:"
wc -l $file | cut -d " " -f1
done >> db_purge_log.txt
Or:
{
for file in $(find . -name "*.csv" )
do
echo "===================================================="
echo "$file"
echo "----------------------------------------------------"
echo "Contacts BEFORE purge:"
wc -l $file | cut -d " " -f1
echo " "
cat $file | egrep -v "xxx|yyy|zzz" | grep -v -E -i '([0-z])\1{2,}' | uniq | sort -u > tmp_file
mv tmp_file $file ;
echo "Contacts AFTER purge:"
wc -l $file | cut -d " " -f1
done
} >> db_purge_log.txt
Or even:
exec >>db_purge_log.txt # By default, standard output will go to db_purge_log.txt
for file in $(find . -name "*.csv" )
do
echo "===================================================="
echo "$file"
echo "----------------------------------------------------"
echo "Contacts BEFORE purge:"
wc -l $file | cut -d " " -f1
echo " "
cat $file | egrep -v "xxx|yyy|zzz" | grep -v -E -i '([0-z])\1{2,}' | uniq | sort -u > tmp_file
mv tmp_file $file ;
echo "Contacts AFTER purge:"
wc -l $file | cut -d " " -f1
done
The first form is adequate for this script which has a single loop in it to provide I/O redirection to. The second form, using {
and }
would handle more general sequences of commands. The third form, using exec
, is 'permanent'; you can't recover the original standard output, whereas with the {
... }
form you can have different sections of the script writing to different places.
One other advantage of all these variations is that you can trivially send errors to the same place that you're sending standard output if that's what you desire. For example:
exec >>db_purge_log.txt 2>&1
Other issues
Suppressing file name from wc
— instead of:
wc -l $file | cut -d " " -f1
use:
wc -l < $file
UUOC — Useless use of cat
— instead of:
cat $file | egrep -v "xxx|yyy|zzz" | grep -v -E -i '([0-z])\1{2,}' | uniq | sort -u > tmp_file
use:
egrep -v "xxx|yyy|zzz" $file | grep -v -E -i '([0-z])\1{2,}' | uniq | sort -u > tmp_file
UUOU — Useless use of uniq
It is not at all clear why you need uniq
and sort -u
; in context, sort -u
is sufficient, so:
egrep -v "xxx|yyy|zzz" $file | grep -v -E -i '([0-z])\1{2,}' | sort -u > tmp_file
UUOG — Useless use of grep
egrep
is equivalent to grep -E
and both are capable of handling multiple regular expressions, and the second will match what is matched by the expression in the parentheses 3 or more times (we really only need to match three times), so in fact the second expression will do the job of the first. And the [0-z]
match is dubious. It probably matches sundry punctuation characters as well as the upper and lower case digits, but you're already doing a case-insensitive search because of the -i
, so we can regularize all that to:
grep -Eiv '([0-9a-z]){3}' $file | sort -u > tmp_file
File names with spaces
The code is not going to handle file names with spaces, tabs or newlines because of the for file in $(find ...)
notation. It probably isn't necessary to deal with that now — be aware of the issue.
Final clean up
for file in $(find . -name "*.csv" )
do
echo "===================================================="
echo "$file"
echo "----------------------------------------------------"
echo "Contacts BEFORE purge:"
wc -l < $file
echo " "
grep -Evi '([0-9a-z]){3}' | sort -u > tmp_file
mv tmp_file $file
echo "Contacts AFTER purge:"
wc -l <$file
done >> db_purge_log.txt
Add the extra functionality
I want to add a command, somewhere in the middle of this loop, to use another .csv
file as suppression list — meaning that every line found as perfect match in that suppression list should be deleted from $file
.
Since we're already sorting the input files ($file
), we can sort the suppression file (call it suppfile='suppressions.txt'
too if it is not already sorted. Given that, we then use comm
to eliminate the lines that appear in both $file
and $suppfile
. We're interested in the lines that only appear in $file
(or, as will be the case here, in the edited and sorted version of the file), so we want to suppress the common entries and the entries from $suppfile
that do not appear in $file
. The comm -23 - "$suppfile"
command reads the edited, sorted file from standard input -
and leaves out the entries from "$suppfile"
suppfile='suppressions.txt' # Must be in sorted order
for file in $(find . -name "*.csv" )
do
echo "===================================================="
echo "$file"
echo "----------------------------------------------------"
echo "Contacts BEFORE purge:"
wc -l < "$file"
echo " "
grep -Evi '([0-9a-z]){3}' | sort -u | comm -23 - "$suppfile" > tmp_file
mv tmp_file "$file"
echo "Contacts AFTER purge:"
wc -l < "$file"
done >> db_purge_log.txt
If the suppression file is not in sorted order, simply sort it into a temporary file. Beware of using the .csv
suffix on the suppression file in the current directory; it will catch the file and empty it because every line in the suppression file matches a line in the suppression file, which is not helpful for any files processed after the suppression file.
Oops — I over-simplified the grep
regex. It should (probably) be:
grep -Evi '([0-9a-z])\1{2}' $file
The difference is considerable. My original rewrite will look for any three adjacent digits or letters (e.g. 123
or abz
); the revision (actually very similar to one of the original commands) looks for a character from [0-9A-Za-z]
followed by two occurrences of the same character (e.g. 111
or aaa
, but not 123
or abz
).
If perchance the alternatives xxx|yyy|zzz
were really not 3 repeated characters, you might need two invocations of grep
in sequence.