bash - Easy parallelisation

Question

I often find myself writing simple for loops to perform an operation to many files, for example:

for i in `find . | grep ".xml$"`; do bzip2 $i; done

It seems a bit depressing that on my 4-core machine only one core is getting used.. is there an easy way I can add parallelism to my shell scripting?

EDIT: To introduce a bit more context to my problems, sorry I was not more clear to start with!

I often want to run simple(ish) scripts, such as plot a graph, compress or uncompress, or run some program, on reasonable sized datasets (usually between 100 and 10,000). The scripts I use to solve such problems look like the one above, but might have a different command, or even a sequence of commands to execute.

For example, just now I am running:

for i in `find . | grep ".xml.bz2$"`; do find_graph -build_graph $i.graph $i; done

So my problems are in no way bzip specific! (Although parallel bzip does look cool, I intend to use it in future).

score 14 · Accepted Answer

Solution: Use xargs to run in parallel (don't forget the -n option!)

find -name \*.xml -print0 | xargs -0 -n 1 -P 3 bzip2

score 6 · Accepted Answer

This perl program fits your needs fairly well, you would just do this:

runN -n 4 bzip2 `find . | grep ".xml$"`

score 4 · Accepted Answer

gnu make has a nice parallelism feature (eg. -j 5) that would work in your case. Create a Makefile

%.xml.bz2 : %.xml


all: $(patsubt %.xml,%xml.bz2,$(shell find . -name '*.xml') )

then do a

nice make -j 5

replace '5' with some number, probably 1 more than the number of CPU's. You might want to do 'nice' this just in case someone else wants to use the machine while you are on it.

score 2 · Accepted Answer

The answer to the general question is difficult, because it depends on the details of the things you are parallelizing. On the other hand, for this specific purpose, you should use pbzip2 instead of plain bzip2 (chances are that pbzip2 is already installed or at least in the repositories or your distro). See here for details: http://compression.ca/pbzip2/

score 2 · Accepted Answer

I find this kind of operation counterproductive. The reason is the more processes access the disk at the same time the higher the read/write time goes so the final result ends in a longer time. The bottleneck here won't be a CPU issue, no matter how many cores you have.

Haven't you ever performed a simple two big file copies at the same time on the same HD drive? I is usually faster to copy one and then another.

I know this task involves some CPU power (bzip2 is demanding compression method), but try measuring first CPU load before going the "challenging" path we all technicians tend to choose much more often than needed.

score 2 · Accepted Answer

I did something like this for bash. The parallel make trick is probably a lot faster for one-offs, but here is the main code section to implement something like this in bash, you will need to modify it for your purposes though:

#!/bin/bash

# Replace NNN with the number of loops you want to run through
# and CMD with the command you want to parallel-ize.

set -m

nodes=`grep processor /proc/cpuinfo | wc -l`
job=($(yes 0 | head -n $nodes | tr '\n' ' '))

isin()
{
  local v=$1

  shift 1
  while (( $# > 0 ))
  do
    if [ $v = $1 ]; then return 0; fi
    shift 1
  done
  return 1
}

dowait()
{
  while true
  do
    nj=( $(jobs -p) )
    if (( ${#nj[@]} < nodes ))
    then
      for (( o=0; o<nodes; o++ ))
      do
        if ! isin ${job[$o]} ${nj[*]}; then let job[o]=0; fi
      done
      return;
    fi
    sleep 1
  done
}

let x=0
while (( x < NNN ))
do
  for (( o=0; o<nodes; o++ ))
  do
    if (( job[o] == 0 )); then break; fi
  done

  if (( o == nodes )); then
    dowait;
    continue;
  fi

  CMD &
  let job[o]=$!

  let x++
done

wait

score 1 · Accepted Answer

I think you could to the following

for i in `find . | grep ".xml$"`; do bzip2 $i&; done

But that would spin off however many processes as you have files instantly and isn't an optimal as just running four processes at a time.

score 1 · Accepted Answer

If you had to solve the problem today you would probably use a tool like GNU Parallel (unless there is a specialized parallelized tool for your task like pbzip2):

find . | grep ".xml$" | parallel bzip2

To learn more:

Watch the intro video for a quick introduction: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial (man parallel_tutorial). You command line with love you for it.

bash - Easy parallelisation

8 回答 8

Related

Reference