bash - BASH：如果它们包含在另一个巨大的列表中，则过滤巨大的数字列表

Question

假设我有两个 CSV 文件。第一个是格式：

id(unique int),owner_id(non-unique int),string

它包含 50-1 亿行。几 GB。

第二个有格式：

integer,integer

第二个文件包含十亿行。我想获取文件 2 的所有行，其中第一列和第二列值都存在于第一个文件第二列（owner_id）中的某处。

最有效的方法是在内存中获取 owner_id 的唯一值，对第二个文件中的每一对进行排序和二进制搜索。我不知道这样的事情是否可以用 BASH 来完成，我可以用 python 来完成（提供两个文件的简单脚本，它将读取、加载它们，并用所有有效对吐出第二个文件）。

但是，如果可能的话，我不想添加对 python 的依赖。

score 1 · Accepted Answer

由于内存限制，这可能会失败。我调用了具有 3 列的文件 file1 和具有 ID 的 file2 将代码段复制并粘贴到文件中并根据需要编辑名称。

第一步：使文件 1 尽可能小。

#/bin/bash
declare -a Array
Count=0

不需要第一列和第三列，因此删除它们，对文件进行排序，然后仅获取唯一条目。

InitFile ()
{
while IFS=, read ignore1 stuff ignore2; do  echo $stuff ; done < file1| sort -n | uniq >  $1
}

读入一个数组：

InitArray ()
{
   while  read  Array[$Count]; do
     let Count++
   done < $1
}

二分查找数组中的值：

BinarySearch ()
{
   val=$1
   let idx=$Count/2
   top=$Count
   bottom=0
   while true; do
      if [ ${Array[$idx]} -eq $val ]; then return 0; fi
      lastIdx=$idx
      if [ $top  -le $bottom ]; then return 1; fi
      if [ $val -lt ${Array[$idx]} ]; then top=$idx && let idx=$idx/2;
      elif [ $val -gt ${Array[$idx]} ]; then bottom=$idx && let idx=($top+$bottom)/2; fi
      if [ $idx -eq $lastIdx ]; then let bottom=$bottom+1 ; fi
   done

}

uniqueOwnerIdFile 将从第一个文件创建，然后放入数组

InitFile uniqueOwnerIdFile
InitArray uniqueOwnerIdFile

遍历第二个文件的每一行并在所有者 ID 数组中查找这两个值。将找到的每一个回显到 linesTheExistFile。

while IFS=, read firstVal secondVal; do
   if BinarySearch $firstVal && BinarySearch $secondVal ; then echo "$firstVal,$secondVal" ; fi
done < file2 > linesThatExistFile

score 0 · Accepted Answer

我不确定纯 bash 中的解决方案，但我可以提供一个使用awk：

awk -F"," 'NR==FNR{col3[$2]++;next;}{ if ($1 in col3 && $2 in col3) print $0} ' File1 File2

首先将第一个文件的第二列读入关联数组，然后查找第二个文件的每一行，无论它们是否在数组中。

score 0 · Accepted Answer

在 bash 中，这样的东西可能会起作用。

#!/bin/bash

list=$(cut -f2 -d, file1.txt | sort -u)

while IFS=, read a b; do
  [[ $list =~ $a && $list =~ $b ]] && echo "$a,$b"
done <file2.txt >result.txt

不过，我不太确定性能。

score 0 · Accepted Answer

Perl 解决方案。它在哈希中记住所有所有者形成文件 1，然后遍历文件 2 并输出两个所有者都存在于哈希中的行。

#!/usr/bin/perl
use warnings;
use strict;

open my $F1, '<', 'file1' or die $!;
my %owner;
while (<$F1>) {
    $owner{(split /,/ => $_, 3)[1]} = 1;
}

open my $F2, '<', 'file2' or die $!;
while (my $line = <$F2>) {
    chomp $line;
    print "$line\n" if 2 == grep exists $owner{$_}, split /,/ => $line, 2;
}

Bash 管道，提供相同的输出，但速度明显较慢：

cut -d, -f2 file1 \
    | grep -vwFf- <(sed 's/,/\n/' file2) \
    | grep -vwFf- file2

bash - BASH：如果它们包含在另一个巨大的列表中，则过滤巨大的数字列表

4 回答 4

Related

Reference