bash - 如何用awk比较两个文件中的数字字段

Question

我有这两个文件：file1

和文件2

1070,1279960511,BR,USA,UNITED STATES
1278,1279960511,US,USA,UNITED STATES
1279,1279960527,CA,CAN,CANADA
1289,1279967231,US,USA,UNITED STATES
2679,1279971327,CA,CAN,CANADA
1279,1279971839,US,USA,UNITED STATES
1279,1279972095,CA,CAN,CANADA
1279,1279977471,US,USA,UNITED STATES
127997,1279977983,CA,CAN,CANADA
127997,1279980159,US,USA,UNITED STATES
127998,1279980543,CA,CAN,CANADA
107599,1075995007,US,USA,UNITED STATES
107599,1075995023,VG,VGB,VIRGIN ISLANDS, BRITISH
107599,1075996991,US,USA,UNITED STATES
107599,1075997071,CA,CAN,CANADA

我想：对于 file1 的每个条目，通过 file2 的第一列，当该列中的值大于“file1”元素时，然后返回 file2 的第三个元素我尝试了很多方法，但没有一个有效我要么得到一个空文件或打印的内容与我预期的不同我最后一次尝试是：

awk -F, '
BEGIN {FS="," ; i=1 ; while (getline < "file2") { x[i] = $1 ; y[i] = $3 ; i++ }}

{ a[$1] = $1 ; h=1 ; while (x[h] <= a[$1]) { h++ } ; { print y[h] }}' file1

但这永远运行它不会停止也不会给我任何帮助plzzz这已经杀死我好几天了，我放弃了谢谢

期望的输出：

#this is a comment and i ll write file 2 as if it was a matrix  

because file1[1] > file2[1,1] ... and file1[1] > file2[2,1] .... and file1[1] > file2[3,1] ... and file1[1] > file2[4,1] but file1[1] < file2[5,1] ... then print file2[4,3] ... which is "US"

now go to file1[2] :

file[2] > file2[1,1] ... and file1[2] > file2[2,1] ... but file1[2] <= file2[3,1] ... then print file2[3,3]

总之我想打印：“第一行的第三个元素（col）（来自file2）file1元素首先变成>下一行的第一个元素（file2）

score 2 · Accepted Answer

我将您的 AWK 脚本作为以下内容的基础。我更改了变量名称以使它们更有意义，因为这有助于自我记录。

#!/usr/bin/awk -f
BEGIN {
    FS=","
    count = 1
    while (getline < "file2") {
        key[count] = $1
        countrycode[count] = $3
        count++
    }
}

{
    for (idx = 1; idx <= count; idx++)
    {
        if ($1 < key[idx]) {
            print countrycode[idx]
            next
        }
    }
}

示例运行（打印$0而不是$3仅打印 - 上面的代码仅打印$3）：

$ sort -n -k1,1 -t, file2 > tmp; mv tmp file2
$ ./scannums file1
2679,1279971327,CA,CAN,CANADA
1289,1279967231,US,USA,UNITED STATES
1278,1279960511,US,USA,UNITED STATES
127997,1279977983,CA,CAN,CANADA
2679,1279971327,CA,CAN,CANADA
1278,1279960511,US,USA,UNITED STATES
1278,1279960511,US,USA,UNITED STATES
1289,1279967231,US,USA,UNITED STATES
127997,1279977983,CA,CAN,CANADA

请注意，file1 中的值 135441 没有打印任何内容，因为 file2 中的任何内容都不符合条件。

如果您愿意，可以将其制成单线。

score 2 · Accepted Answer

这行得通吗？

sort -n -t"," -k1,1 file1 file2 | awk -F"," '{if ($3 != "") {s = $3;} else {print $1 " " s;}}'

生产

1075 BR
1169 BR
1260 BR
1279 US
1281 US
1474 US
2537 US
10759 CA
12799 CA
135441 CA

如果file1中的原始顺序很重要，可以使用以下

awk '{print NR "," $1}' file1 file2 | sort -t"," -n -k 2,2 | awk -F"," '{if ($4 != "") {s = $4;} else {print $1 " " s;}}' | sort -t"," -k1,1 | cut -d" " -f2

生产

US
CA
BR
BR
US
CA
US
BR
CA
US

score 1 · Accepted Answer

您不能只xargs用于作业的“读取文件1”部分吗？awk 中的单个“在 file2 中查找值”部分非常简单，并且您避免了双文件指针......

编辑：使用 xargs 和 awk 的示例。

cat file1 | xargs awk '$1 > ARGV[2] {print $3; return}' file2

编辑：这个例子有效（现在在我的电脑上试过......）

使用 -n 1 作为 xargs 的选项以在每次传递中仅传递一个参数。存储后删除“val”参数，因此 AWK 只获取文件名 (file2) 并知道要做什么。找到时标记，返回不存在。

cat file1 | xargs -n 1 awk -F, 'BEGIN {val = ARGV[2]; ARGC--; found=0} $1 > val {if (found==0) { print val, $3; found = 1}}' file2

编辑：较短的版本

cat file1 | xargs -n 1 awk -F, 'BEGIN {val = ARGV[2]; ARGC--} (!found) && ($1 > val)  {print val, $3; found = 1}' file2

脚本版本：

#!/usr/bin/awk -f
BEGIN {
  val = ARGV[2]
  ARGC--
}
(!found) && ($1 <= val) {
  # cache 3rd column of previous line
  prev = $3
}
(!found) && ($1 > val) {
  # print cached value as soon as we cross the limit
  print val, prev
  found = 1
}

将其命名为 find_val.awk 并 chmod +x 。您只需find_val.awk somefile somevalue以相同的方式执行和使用 xargs

cat file1 | xargs -n 1 find_val.awk file2

score 1 · Accepted Answer

长单线：

这是您可以执行此操作的一种方法：

cat file1|grep -vE '^$'|while read min; do cat file2|while read line; do val=$(echo $line|cut -d, -f1); if [ $min -lt $val ]; then short_country=$(echo $line|cut -d, -f3); echo $min: $short_country "($val)"; break; fi; done; done

这产生了输出

2537: CA (2679)
1279: US (1289)
1075: US (1278)
12799: CA (127997)
1474: CA (2679)
1260: US (1278)
1169: US (1278)
1281: US (1289)
10759: CA (127997)

解释

如果你在脚本中分解它，而不是让它成为一个单行，它更容易理解：

#!/bin/bash

cat file1 |                               # read file1
grep -E '^[0-9]+$' |                      # filter out lines in file1 that don't just contain a number
while read min; do                        # for each line in file1:
  cat file2 |                               # read file2
  grep -E '^([0-9]+,){2}[A-Z]{2},' |        # filter out lines in file2 that don't match the right format
  while read line; do                       # for each line in file2:
    val=$(echo $line|cut -d, -f1)             # pull out $val: the first comma-delimited value
    if [ $min -lt $val ]; then                # if it's greater than the $min value read from file1:
      short_country=$(echo $line|cut -d, -f3)   # get the $short_country from the third comma-delimited value in file2
      echo "$min: $short_country ($val)"        # print it to stdout. You can get rid of ($val) here if you're not interested in it.
      break                                     # Now that we've found a value in file2, stop this loop and go to the next line in file1
    fi
  done
done

因为你最初没有指定你的输出格式，我猜。希望这种方式对您有用。

bash - 如何用awk比较两个文件中的数字字段

4 回答 4

长单线：

解释

Related

Reference