6

I often find myself writing simple for loops to perform an operation to many files, for example:

for i in `find . | grep ".xml$"`; do bzip2 $i; done

It seems a bit depressing that on my 4-core machine only one core is getting used.. is there an easy way I can add parallelism to my shell scripting?

EDIT: To introduce a bit more context to my problems, sorry I was not more clear to start with!

I often want to run simple(ish) scripts, such as plot a graph, compress or uncompress, or run some program, on reasonable sized datasets (usually between 100 and 10,000). The scripts I use to solve such problems look like the one above, but might have a different command, or even a sequence of commands to execute.

For example, just now I am running:

for i in `find . | grep ".xml.bz2$"`; do find_graph -build_graph $i.graph $i; done

So my problems are in no way bzip specific! (Although parallel bzip does look cool, I intend to use it in future).

4

8 回答 8

14

Solution: Use xargs to run in parallel (don't forget the -n option!)

find -name \*.xml -print0 | xargs -0 -n 1 -P 3 bzip2
于 2008-11-11T21:20:37.717 回答
6

This perl program fits your needs fairly well, you would just do this:

runN -n 4 bzip2 `find . | grep ".xml$"`
于 2008-11-11T19:53:33.447 回答
4

gnu make has a nice parallelism feature (eg. -j 5) that would work in your case. Create a Makefile

%.xml.bz2 : %.xml


all: $(patsubt %.xml,%xml.bz2,$(shell find . -name '*.xml') ) 

then do a

nice make -j 5

replace '5' with some number, probably 1 more than the number of CPU's. You might want to do 'nice' this just in case someone else wants to use the machine while you are on it.

于 2008-11-11T20:53:18.040 回答
2

The answer to the general question is difficult, because it depends on the details of the things you are parallelizing. On the other hand, for this specific purpose, you should use pbzip2 instead of plain bzip2 (chances are that pbzip2 is already installed or at least in the repositories or your distro). See here for details: http://compression.ca/pbzip2/

于 2008-11-11T19:53:49.173 回答
2

I find this kind of operation counterproductive. The reason is the more processes access the disk at the same time the higher the read/write time goes so the final result ends in a longer time. The bottleneck here won't be a CPU issue, no matter how many cores you have.

Haven't you ever performed a simple two big file copies at the same time on the same HD drive? I is usually faster to copy one and then another.

I know this task involves some CPU power (bzip2 is demanding compression method), but try measuring first CPU load before going the "challenging" path we all technicians tend to choose much more often than needed.

于 2008-11-11T20:00:55.837 回答
2

I did something like this for bash. The parallel make trick is probably a lot faster for one-offs, but here is the main code section to implement something like this in bash, you will need to modify it for your purposes though:

#!/bin/bash

# Replace NNN with the number of loops you want to run through
# and CMD with the command you want to parallel-ize.

set -m

nodes=`grep processor /proc/cpuinfo | wc -l`
job=($(yes 0 | head -n $nodes | tr '\n' ' '))

isin()
{
  local v=$1

  shift 1
  while (( $# > 0 ))
  do
    if [ $v = $1 ]; then return 0; fi
    shift 1
  done
  return 1
}

dowait()
{
  while true
  do
    nj=( $(jobs -p) )
    if (( ${#nj[@]} < nodes ))
    then
      for (( o=0; o<nodes; o++ ))
      do
        if ! isin ${job[$o]} ${nj[*]}; then let job[o]=0; fi
      done
      return;
    fi
    sleep 1
  done
}

let x=0
while (( x < NNN ))
do
  for (( o=0; o<nodes; o++ ))
  do
    if (( job[o] == 0 )); then break; fi
  done

  if (( o == nodes )); then
    dowait;
    continue;
  fi

  CMD &
  let job[o]=$!

  let x++
done

wait
于 2008-11-11T21:10:00.860 回答
1

I think you could to the following

for i in `find . | grep ".xml$"`; do bzip2 $i&; done

But that would spin off however many processes as you have files instantly and isn't an optimal as just running four processes at a time.

于 2008-11-11T19:46:52.010 回答
1

If you had to solve the problem today you would probably use a tool like GNU Parallel (unless there is a specialized parallelized tool for your task like pbzip2):

find . | grep ".xml$" | parallel bzip2

To learn more:

于 2014-03-05T22:48:59.700 回答