3

我对相当大的 csv 文件有一点问题。我能够编写简单的 bash/awk 脚本,但是对于我有限的 awk/bash 编程经验来说,这个问题更难解决。

问题:

  • 我所有的文件都在文件夹中。文件夹有偶数个需要成对修剪的 csv 文件(我将用这种方式进行解释)。文件命名如下:f1L、f1R、f2L、f2R、f3L、f3R、...、fnL、fnR。

  • 文件需要成对阅读,即。f1L 与 f1R。f2L 与 f2R 等等

  • 文件有两个逗号分隔的字段。f1L(文件开始/结束)和 f1R,看起来像

f1L (START)
1349971210, -0.984375 
1349971211, -1.000000 

f1R (START) 
1349971206, -0.015625
1349971207, 0.000000

f1L (END)
1350230398, 0.500000
1350230399, 0.515625

f1R (END) 
1350230402, 0.484375
1350230403, 0.515625

我想用 awk 做的是:

  1. 读取记录 1,f1L 的字段 1(即 1349971210),然后读取记录 1,f1R 的字段 1(即 1349971206)。然后取两个值的最大值(即 x1 = 1349971210)。
  2. 读取最后一条记录,f1L 的字段 1(即 1350230399),然后读取最后一条记录,f1R 的字段 1(即 1350230403)。然后取最小值(即 x2 = 1350230399)。
  3. 然后提取并以相同的名称重新保存 f1L 和 f1R 中大于/等于 x1 和小于/等于 x2 之间的所有行。
  4. 对我目录中的所有对重复该过程。

想知道你们中是否有人对使用 bash/awk 的小脚本有任何建议来完成工作。

4

3 回答 3

1

我尝试包括所有必要的完整性检查并最小化磁盘 I/O(假设您的文件足够大以至于读取它们是时间限制因素)。此外,不必将文件作为一个整体在内存中读取(假设您的文件可能比可用 RAM 更大)。

然而,这只是尝试使用一个非常基本的虚拟输入 - 所以请测试它并报告任何问题。

首先,我编写了一个修剪一对(由 f...L 文件名标识)的脚本:

#!/bin/sh

#############    
# trim_pair #
#-----------#############################
# given fXL file path, trim fXL and fXR #
#########################################

#---------------# 
# sanity checks #
#---------------#

# error function
error(){
 echo >&2 "$@"
 exit 1
}

# argument given?
[[ $# -eq 1 ]] || \
 error "usage: $0 <file>"
LFILE="$1"

# argument format valid?
[[ `basename "$LFILE" | egrep '^f[[:digit:]]+L$'` ]] || \
 error "invalid file name: $LFILE (has to match /^f[[:digit:]]+L$/)"
RFILE="`echo $LFILE | sed s/L$/R/`" # is there a better POSIX compliant way?

# files exists?
[[ -e "$LFILE" ]] || \
 error "file does not exist: $LFILE"
[[ -e "$RFILE" ]] || \
 error "file does not exist: $RFILE"

# files readable?
[[ -r "$LFILE" ]] || \
 error "file not readable: $LFILE"
[[ -r "$RFILE" ]] || \
 error "file not readable: $RFILE"

# files writable?
[[ -w "$LFILE" ]] || \
 error "file not writable: $LFILE"
[[ -w "$RFILE" ]] || \
 error "file not writable: $RFILE"

#------------------#
# create tmp files #
# & ensure removal #
#------------------#

# cleanup function
cleanup(){
 [[ -e "$LTMP" ]] && rm -- "$LTMP"
 [[ -e "$RTMP" ]] && rm -- "$RTMP"
}

# cleanup on exit
trap 'cleanup' EXIT

#create tmp files
LTMP=`mktemp --tmpdir` || \
 error "tmp file creation failed"
RTMP=`mktemp --tmpdir` || \
 error "tmp file creation failed"

#----------------------#
# process both files   #
# prepended by their   #
# first and last lines #
#----------------------#

# extract first and last lines without reading the whole files twice
{
 head -q -n1 "$LFILE" "$RFILE"  # no need to read the whole files
 tail -q -n1 "$LFILE" "$RFILE"  # no need to read the whole files
} | awk -F, '
 NF!=2{
  print "incorrect file format: record "FNR" in file "FILENAME > "/dev/stderr"
  exit 1    
 }
 NR==1{                         # read record 1,
  x1=$1                         # field 1 of L,
  next                          # then read
 }
 NR==2{                         # record 1 of R,
  x1=$1>x1?$1:x1                # field 1 & take the max,
  next                          # then
 }
 NR==3{                         # read last record,
  x2=$1                         # field 1 of L,
  next                          # then
 }
 NR==4{                         # last record of R
  x2=$1>x2?$1:x2                # field 1 & take the max
  next
 }
 FILENAME!="-"&&NR<5{
  print "too few lines in input" > "/dev/stderr"
 }
 FNR==1{
  outfile=FILENAME~/L$/?"'"$LTMP"'":"'"$RTMP"'"
 }
 $1>=x1&&$1<=x2{
  print > outfile
 }
' - "$LFILE" "$RFILE" || \
 error "error while trimming"

#-----------------------#
# re-save trimmed files #
# under the same names  #
#-----------------------#

mv -- "$LTMP" "$LFILE" || \
 error "cannot re-save $LFILE"
mv -- "$RTMP" "$RFILE" || \
 error "cannot re-save $RFILE"

如您所见,主要思想是在重要行之前使用输入,head然后根据您的要求tail使用它们进行处理。awk

要为某个目录中的所有文件调用该脚本,您可以使用以下脚本(不像上面那样有效,但我想您可以自己想出类似的东西):

#!/bin/sh

############
# trim all #
#----------###################################
# find L files in current or given directory #
# and trim the corresponding file pairs      #
##############################################

TRIM_PAIR="trim_pair"   # path to the trim script for one pair

if [[ $# -eq 1 ]]
then
 WD="$1"
else
 WD="`pwd`"
fi

find "$WD"                         \
 -type f                           \
 -readable                         \
 -writable                         \
 -regextype posix-egrep            \
 -regex "^$WD/"'f[[:digit:]]+L'    \
 -exec "$TRIM_PAIR" "{}" \;

请注意,您必须有 trim_pair 脚本PATH或调整脚本中的TRIM_PAIR变量trim_all

于 2013-07-06T22:55:05.333 回答
1

A naive way of achieving that in . Not looking for efficiency here at all. No error checkings (well, only the mandatory minimum).

Name this script myscript. It will take two parameters (files fxL and fxR).

#!/bin/bash

tmp=''

die() {
    echo >&2 "$@"
    exit 1
}

on_exit() {
    [[ -f $tmpL ]] && rm -- "$tmpL"
    [[ -f $tmpR ]] && rm -- "$tmpR"
}

last_non_blank_line() {
   sed -n -e $'/^$/ !h\n$ {x;p;}' "$1"
}

(($#==2)) || die "script takes two arguments"

fL=$1
fR=$2

[[ -r "$fL" && -w "$fL" ]] || die "problem with file \`$fL'"
[[ -r "$fR" && -w "$fR" ]] || die "problem with file \`$fR'"

# read record1, line1 of fL and fR
IFS=, read min _ < "$fL"
[[ $min =~ ^[[:digit:]]+$ ]] || die "first line of \`$fL' has a bad record"
IFS=, read t _ < "$fR"
[[ $t =~ ^[[:digit:]]+$ ]] || die "first line of \`$fR' has a bad record"
((t>min)) && ((min=t))

# read record1, last line of fL and fR
IFS=, read max _ < <( last_non_blank_line "$fL")
[[ $max =~ ^[[:digit:]]+$ ]] || die "last line of \`$fL' has a bad record"
IFS=, read t _ < <(last_non_blank_line "$fR")
[[ $t =~ ^[[:digit:]]+$ ]] || die "last line of \`$fR' has a bad record"
((t<max)) && ((max=t))

# create tmp files
tmpL=$(mktemp --tmpdir) || die "can't create tmp file"
tmpR=$(mktemp --tmpdir) || die "can't create tmp file"

trap 'on_exit' EXIT

# Read fL line by line, and only keep those
# the first record of which is between min and max
while IFS=, read a b; do
    [[ $a =~ ^[[:digit:]]+$ ]] && ((a<=max)) && ((a>=min)) && echo "$a,$b"
done < "$fL" > "$tmpL"
mv -- "$tmpL" "$fL"

# Same with fR:
while IFS=, read a b; do
    [[ $a =~ ^[[:digit:]]+$ ]] && ((a<=max)) && ((a>=min)) && echo "$a,$b"
done < "$fR" > "$tmpR"
mv -- "$tmpR" "$fR"

and call it as:

$ myscript f1L f1R

Use it on scratch files first! No warranty! Use at your own risk!

Caveat. As the script uses arithmetic for comparisons, it is assumed that the first record of each line in each file is an integer in the range that handles.


Edit. Since your first records are floats, you can't use the method above that uses arithmetic. A very funny way is to get do all the necessary operations (get first line, last line, open files, …) and use for the arithmetic part. With this, you won't be limited at all with the size of the numbers ( uses arbitrary precision), and floats are welcome! For example:

#!/bin/bash

tmp=''

die() {
    echo >&2 "$@"
    exit 1
}

on_exit() {
    [[ -f $tmpL ]] && rm -- "$tmpL"
    [[ -f $tmpR ]] && rm -- "$tmpR"
}

last_non_blank_line() {
   sed -n -e $'/^$/ !h\n$ {x;p;}' "$1"
}

(($#==2)) || die "script takes two arguments"

fL=$1
fR=$2

[[ -r "$fL" && -w "$fL" ]] || die "problem with file \`$fL'"
[[ -r "$fR" && -w "$fR" ]] || die "problem with file \`$fR'"

# read record1, line1 of fL and fR
IFS=, read a _ < "$fL"
IFS=, read b _ < "$fR"
min=$(bc <<< "if($b>$a) { print \"$b\" } else { print \"$a\" }" 2> /dev/null)
[[ -z $min ]] && die "problem in first line of files \`$fL' or \`$fR'"

# read record1, last line of fL and fR
IFS=, read a _ < <( last_non_blank_line "$fL")
IFS=, read b _ < <(last_non_blank_line "$fR")
max=$(bc <<< "if($b<$a) { print \"$b\" } else { print \"$a\" }" 2> /dev/null)
[[ -z $max ]] && die "problem in last line of files \`$fL' or \`$fR'"

# create tmp files
tmpL=$(mktemp --tmpdir) || die "can't create tmp file"
tmpR=$(mktemp --tmpdir) || die "can't create tmp file"

trap 'on_exit' EXIT

# Read fL line by line, and only keep those
# the first record of which is between min and max
while read l; do
    [[ $l =~ ^[[:space:]]*$ ]] && continue
    r=${l%%,*}
    printf "if(%s>=$min && %s<=$max) { print \"%s\n\" }\n" "$r" "$r" "$l"
done < "$fL" | bc > "$tmpL" || die "Error in bc while doing file \`$fL'"

# Same with fR:
while read l; do
    [[ $l =~ ^[[:space:]]*$ ]] && continue
    r=${l%%,*}
    printf "if(%s>=$min && %s<=$max) { print \"%s\n\" }\n" "$r" "$r" "$l"
done < "$fR" | bc > "$tmpR" || die "Error in bc while doing file \`$fR'"

mv -- "$tmpL" "$fL"
mv -- "$tmpR" "$fR"
于 2013-06-24T18:38:58.847 回答
1

使用

use warnings;
use strict;

my $dir = $ARGV[0];  # directory is argument
my @pairs;
for my $file (glob "$dir/f[0-9]*L") {
    my $n = ($file =~ /(\d+)/)[0];
    my ($fn1, $fn2) = ($file, "f${n}R");
    my ($dL, $dR) = (loadfile($fn1), loadfile($fn2));
    my ($min, $max) = (min($dL->[0][1], $dR->[0][1]),
                       max($dL->[-1][1], $dR->[-1][1]));    
    trimfile($fn1, $dL, $min, $max);
    trimfile($fn2, $dL, $min, $max);
}

sub loadfile {
    my ($fname, @d) = (shift);
    open(my $fh, "<", $fname) or die ("$!");
    chomp, push(@d, [ split(/[, ]+/, $_) ]) while <$fh>;
    close $fh;
    return \@d;
}

sub trimfile {
    my ($fname, $data, $min, $max) = @_;
    open(my $fh, ">", $fname) or die ("$!");
    print($fh $_->[0], " ", $_->[1], "\n") for @$data;
    close $fh;
}

sub min { my ($a,$b) = @_; return $a < $b ? $a : $b; }
sub max { my ($a,$b) = @_; return $a > $b ? $a : $b; }
于 2013-07-06T23:37:24.797 回答