4

这个问题与以下问题有关:

如何将制表符分隔的数据(不同格式)解析为 data.table/data.frame?

我有一个格式错误的文本文件,其中制表符分隔的格式如下:

A   1092    -   1093    +   1X
B   1093    HRDCPMRFYT
A   1093    +   1094    -   1X
B   1094    BSZSDFJRVF
A   1094    +   1095    +   1X
B   1095    SSTFCLEPVV
...

但是,文本文件中有几行在技术上是制表符分隔的,但它们是长字符串例如这里的行'Z'和'Y'

Z  FX:E:4.2
Y   23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M 
A   1092    -   1093    +   1X
B   1093    HRDCPMRFYT
A   1093    +   1094    -   1X
B   1094    BSZSDFJRVF
A   1094    +   1095    +   1X
B   1095    SSTFCLEPVV
...

该文本文件中有一段Y 23434M,23434M,...可能长达几 GB。

这些行非常罕见,仅由前面的Zor标记Y。我目前在文本编辑器中打开了文件并删除了这些行。

但是,这在算法上并不合理。有没有办法解析这个文件,以便(1)只使用行A并且B使用(2)行Z并且Y明确不使用?

编辑:澄清一下, Z不是一个长字符串。这里只有“Y”是一个长字符串。是格式的字符串X XX:X:0.0,其中X是一个字符和0一个整数。

4

1 回答 1

3

您可以进行系统调用以使用sed某种模式修复文件,比如说 。如果你想删除所有以开头的行,Z或者Y你可以简单地传递一个正则表达式,然后/d

system("sed -i '/^[ZY]/d' test.tab")

上面的命令将从您的文件中删除所有以 Z 或 Y 开头的行。然后,您可以运行我在上一个问题中发布的相同代码

library(data.table)
fread("sed '$!N;s/\\n/ /' test.tab")
#    V1   V2 V3   V4 V5   V6   V7         V8
# 1:  A 1092  - 1093  + 1X B 1093 HRDCPMRFYT
# 2:  A 1093  + 1094  - 1X B 1094 BSZSDFJRVF
# 3:  A 1094  + 1095  + 1X B 1095 SSTFCLEPVV

数据

text <- "Z FX:E:4.2
Y  23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M 
A   1092    -   1093    +   1X
B   1093    HRDCPMRFYT
A   1093    +   1094    -   1X
B   1094    BSZSDFJRVF
A   1094    +   1095    +   1X
B   1095    SSTFCLEPVV"

# Saving it as tab separated file on disk
write(gsub(" +", "\t", text), file = "test.tab")
于 2018-05-14T05:04:57.390 回答