1

我有一个大文件(2GB),看起来像这样:

  >10GS_A
  YTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGD
  LTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKD
  DYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFP
  LLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ 
  >11BA_A
  KESAAAKFERQHMDSGNSPSSSSNYCNLMMCCRKMTQGKCKPVNTFVHESLADVKAV
  CSQKKVTCKNGQTNCYQSKSTMRITDCRETGSSKYPNCAYKTTQVEKHIIVACGGKP
  SVPVHFDASV
  >11BG_A
  KESAAKFERQHMDSGNSPSSSSNYCNLMMCCRKMTQGKCKPVNTFVHESLADVKAVCSQKKVT
  CKNGQTNCYQSKSTMRITDCRETGSSKYPNCAYKTTQVEKHIIVACGGKPSVPVHFDASV
  >121P_A
  MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRD 
  QYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYG
  IPYIETSAKTRQGVEDAFYTLVREIRQH

我想将此文件拆分为基于分隔符“>”的较小文件,在这种情况下,生成了 4 个包含以下文本并以下列方式命名的文件:

10gs_A.txt
11ba_A.txt
11bg_A.txt
121p_A.txt

并且它们包含以下内容:10gs_A.txt

>10GS_A
YTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGD
LTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKD
DYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFP
LLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ

11ba_A.txt

>11BA_A
KESAAAKFERQHMDSGNSPSSSSNYCNLMMCCRKMTQGKCKPVNTFVHESLADVKAV
CSQKKVTCKNGQTNCYQSKSTMRITDCRETGSSKYPNCAYKTTQVEKHIIVACGGKP
SVPVHFDASV

... 等等。我知道在 linux 中使用 split 命令分隔较大的文本文件,但是它将创建的文件命名为 temp00、temp01、temp03。有没有办法拆分这个较大的文件并根据我的需要命名文件?实现此目的的拆分功能是什么?

4

2 回答 2

1

如何使用 awk 脚本拆分 mybigfile

分离器.awk

BEGIN {outname = "noname.txt"}

/^>/  { outname = substr($0,2,40) ".txt"
        next }

      { print > outname }

如果要在输出中使用分隔符行,请使用以下命令:

分离器.awk

BEGIN {outname = "noname.txt"}

/^>/  { outname = substr($0,2,40) ".txt"}

      { print > outname }

然后运行这个文件

awk -f splitter.awk mybigfile
于 2012-06-01T14:07:21.767 回答
1

有了gawk你可以做 -

gawk -v RS='>' 'NF{ print RS$0 > $1".txt" }' InputFile
于 2012-06-02T06:33:05.473 回答