假设您已设法打开文件并假设它是data.frame
带有factor
列的,您可以使用因子已经是编号为的数字列的事实1
:
DF <- read.table(text = "ID1 ID2 ID3 ID4 ID5
SNP1 AA AA AB AA BB
SNP2 AB AA BB AA AA
SNP3 BB BB BB AB BB
SNP4 AA AB BB BB AA
SNP5 AA AA AA AA AA
", header = TRUE, sep = "")
for (i in seq_along(DF)) {
# check if the column levels are ordered correctly; if not
# relevel the column
if (!identical(levels(DF[[i]]), c("AA", "AB", "BB"))) {
warning("Levels do not match in column ", i, ". Relevelling.")
DF[[i]] <- factor(DF[[i]], levels=c("AA", "AB", "BB"))
}
# remove the class of the column: this basically makes an integer
# column from the factor
attr(DF[[i]], "class") <- NULL
# substract 1 to get number from 0
DF[[i]] <- DF[[i]] - 1
}
代码检查级别是否正确编号,并在必要时重新调整级别。希望这不会经常发生,因为这会减慢速度。
可能是您的文件不适合内存,这将导致 Windows/Linux/... 使用磁盘进行内存存储。这将大大减慢速度。在这种情况下,您可能最好使用ff
或之类的包bigmemory
。