r - R扫描两列并保持唯一

Question

我有一个制表符分隔的数据文件，其中有四列我想读取 R 中的前两列，并且只将唯一的 2 列对保留为data.frame. 该文件可以在数百万行中：

cluster-1    3    12412341324    13412341234
cluster-1    3    62626662346    54234524354
cluster-1    3    45454345354    45454544545
cluster-2    644  12332234341    37535473475
cluster-2    644  54654365466    56565634543
cluster-2    644  56356356536    35634563456
...
cluster-9999999    123    123412341241    143132423
...

我想使用scan（或任何更好的选择）来读取文件并最终得到一个data.frame：

cluster-1    3
cluster-2    644
cluster-3    343
...
cluster-9999999    123

在 R 中读取这些大文件最省时的方法是什么？

score 6 · Accepted Answer

已知且相对较少的列：如果您知道列数，例如 5 列，并且想要前两列（或只有少数列），则可以使用colClassesfrom来完成read.table：

# header here is set to false because I don't see one in your file
df <- read.table("~/file.txt", header = FALSE, 
              colClasses=c("character", "numeric", "NULL", "NULL", "NULL"))

在这里，我们将第 3 列到第 5 列设置NULL为跳过它们。

未知列/很多列：如果您不知道列或列太多，另一种选择是使用pipewith awk（或pipe与cut此相关的）首先使用您需要的列过滤文件，然后使用加载它read.table：

# header here is set to false because I don't see one in your file
df <- read.table(pipe("awk '{print $1\"\t\"$2}' ~/file.txt"), 
                       header = FALSE, sep = "\t")

删除重复的行：使用duplicatedfrom baseas：

df <- df[!duplicated(df), ]

r - R扫描两列并保持唯一

1 回答 1

Related

Reference