awk - 使用 grep 或 awk 匹配文本

Question

我遇到了 grep 和 awk 的问题。我认为这是因为我的输入文件包含看起来像代码的文本。

输入文件包含 ID 名称，如下所示：

SNORD115-40
MIR432
RNU6-2

参考文件如下所示：

Ensembl Gene ID HGNC symbol
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000266661
ENSG00000243133
ENSG00000207447 RNU6-2

我想将源文件中的 ID 名称与参考文件匹配并打印出相应的 ensg ID 号，以便输出文件如下所示：

ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2

我试过这个循环：

exec < source.file
while read line
do
grep -w $line reference.file > outputfile
done

我也尝试过使用 awk 来处理参考文件

awk 'NF == 2 {print $0}' reference file
awk 'NF >2 {print $0}' reference file

但我只得到一个 grep 的 ID。

任何建议或更简单的方法都会很棒。

score 8 · Accepted Answer

$ fgrep -f source.file reference.file 
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2

fgrep相当于grep -F：

   -F, --fixed-strings
          Interpret  PATTERN  as  a  list  of  fixed strings, separated by
          newlines, any of which is to be matched.  (-F  is  specified  by
          POSIX.)

该-f选项用于PATTERN从文件中获取：

   -f FILE, --file=FILE
          Obtain  patterns  from  FILE,  one  per  line.   The  empty file
          contains zero patterns, and therefore matches nothing.   (-f  is
          specified by POSIX.)

reference.file如评论中所述，如果 ID in包含 ID insource.file作为子字符串，这可能会产生误报。grep您可以使用以下方法即时构建更明确的模式sed：

grep -f <( sed 's/.*/ &$/' input.file) reference.file

但是这种方式将模式解释为正则表达式而不是固定字符串，这可能是易受攻击的（尽管如果 ID 仅包含字母数字字符，则可能没问题）。不过，更好的方法（感谢@sidharthcnadhan）是使用以下-w选项：

   -w, --word-regexp
          Select  only  those  lines  containing  matches  that form whole
          words.  The test is that the matching substring must  either  be
          at  the  beginning  of  the  line,  or  preceded  by  a non-word
          constituent character.  Similarly, it must be either at the  end
          of  the  line  or  followed by a non-word constituent character.
          Word-constituent  characters  are  letters,  digits,   and   the
          underscore.

所以你的问题的最终答案是：

grep -Fwf source.file reference.file

score 4 · Accepted Answer

这可以解决问题：

$ awk 'NR==FNR{a[$0];next}$NF in a{print}' input reference
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2

score 1 · Accepted Answer

这是一个不错的bash尝试。问题是您总是覆盖结果文件。使用“>>”代替>或移动>后面done

grep -w $line reference.file >> outputfile

或者

done  > outputfile

但我更喜欢 Lev 的解决方案，因为它只启动一次外部进程。

如果你想用 pure 解决它bash，你可以试试这个：

ID=($(<IDfile))

while read; do
   for((i=0;i<${#ID[*]};++i)) {
       [[ $REPLY =~ [[:space:]]${ID[$i]}$ ]] && echo $REPLY && break
   }
done <RefFile >outputfile

cat outputfile

输出：

ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2

较新bash的支持关联数组。它可用于简化和加快密钥的搜索：

declare -A ID
for i in $(<IDfile); { ID[$i]=1;}

while read v; do
   [[ $v =~ [[:space:]]([^[:space:]]+)$ && ${ID[${BASH_REMATCH[1]}]} = 1 ]] && echo $v
done <RefFile

awk - 使用 grep 或 awk 匹配文本

3 回答 3

Related

Reference