1

对不起,一般的问题。我正在寻找用于整理数据文件夹的指针,其中有许多 .txt 文件。它们都有不同的标题,并且对于绝大多数文件来说,文件具有相同的维度,即列号相同。然而,痛苦是一些文件,尽管列数相同,但列名不同。也就是说,在这些文件中,测量了其他一些变量。

我想清除这些文件,而我不能通过简单地比较列号来做到这一点。有什么方法可以传递列的名称并检查目录中有多少文件具有该列,以便我可以将它们删除到不同的文件夹中?

更新:

我创建了一个虚拟文件夹来包含反映问题的文件,请参阅下面的链接以访问我的谷歌驱动器上的文件。在这个文件夹中,我选取了 4 个包含问题列的文件。

https://drive.google.com/drive/folders/1IDq7BwfQNkGb9y3RvwlLE3FeMQc38taD?usp=sharing

问题是代码似乎能够找到匹配选择标准的文件,也就是问题列的实际名称,但我无法提取列表中此类文件的真实索引。任何指针?

library(data.table)

#read in the example file that have the problem column content
df_var <- read.delim("ctrl_S3127064__3S_DMSO_00_none.TXT", header = T, sep = "\t")

#read in a file that I want to use as reference
df_standard <- read.delim("ctrl__S162465_20190111_T8__3S_2DG_3mM_none.TXT", header = T, sep = "\t")

#get the names of columns of each file
standar.names <- names(df_standard)
var.names <- names(df_var)

same.titles <- var.names %in% standar.names

dff.titles <- !var.names %in% standar.names

#confirm the only 3 columns of problem is column 129,130 and 131 
mismatched.names <- colnames(df_var[129:131])

#visual check the names of the problematic columns
mismatched.names


# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)

# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
  l_files[[i]] <- read.delim(file = files_in_wd[i],
                         sep = "\t",
                         header = T,
                         nrows = 2)
}

# get column names of all files
column_names <- lapply(l_files, names)

# get unique names of files
unique_names <- unique(mismatched.names)
unique_names[1]
# decide which files to remove
#here there the "too_keep" returns an integer vector that I don't undestand
#I thought the numbers should represent the ID/index of the elements
#but I have less than 10 files, but the numbers in to_keep are around 1000
#this is probably because it's matching the actually index of the unlisted list
#but if I use to_keep <- which(column_names%in% unique_names[1]) it returns empty vector

to_keep <- which(unlist(column_names)%in% unique_names[1])


#now if I want to slice the file using to_keep the files_to_keep returns NA NA NA
files_to_keep <- files_in_wd[to_keep]

#once I have a list of targeted files, I can remove them into a new folder by using file.remove
library(filesstrings)
file.move(files_to_keep, "C:/Users/mli/Desktop/weeding/need to reanalysis" )
4

2 回答 2

1

如果您可以根据列名区分要保留的文件和要删除的文件,则可以使用以下内容:

# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files")

# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)

# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
  l_files[[i]] <- read.delim(file = files_in_wd[i],
                             sep = ';',
                             header = T,
                             nrows = 2)
}

# get column names of all files
column_names <- lapply(l_files, names)
# get unique names of files
unique_names <- unique(column_names)
# decide which files to keep
to_keep <- which(column_names %in% unique_names[1])

files_to_keep <- files_in_wd[to_keep]

如果你有很多文件,你可能应该避免循环或者只是读入相应文件的标题。

在您发表评论后编辑:

  • 通过添加 nrows = 2 代码仅读取前 2 行 + 标题。
  • 我假设文件夹中的第一个文件具有您想要保留的结构,这就是为什么 column_names 与 unique_names[1] 进行检查的原因。
  • files_to_keep 包含您要保留的文件的名称
  • 您可以尝试在数据的子集上运行它,看看它是否有效,然后再担心效率。我认为矢量化方法可能会更好。

编辑:此代码适用于您的虚拟数据。

library(filesstrings)

# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files/dummyset")

# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)

# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
  l_files[[i]] <- read.delim(file = files_in_wd[i],
                             sep = "\t",
                             header = T,
                             nrows = 2,
                             encoding = "UTF-8",
                             check.names = FALSE
                            )
}

# get column names of all files
column_names <- lapply(l_files, names)
# decide which files to keep
to_keep <- column_names[[1]] # e.g. column names of file #1 are ok

# check if the other files have the same header:
df_filehelper <- data.frame('fileindex' = seq_along(files_in_wd),
  'filename' = files_in_wd,
  'keep' = NA)

for(i in 2:length(files_in_wd)){
  df_filehelper$keep[i] <- identical(to_keep, column_names[[i]])
}

df_filehelper$keep[1] <- TRUE # keep the original file used for selecting the right columns

# move files out of the current folder:
files_to_move <- df_filehelper$filename[!df_filehelper$keep] # selects file that are not to be kept

file.move(files_to_move, "C:/Users/tester/Desktop/generic-text-files/dummyset/testsubfolder/")
于 2020-10-23T21:01:08.760 回答
0

由于文件的数量和大小很大,可能值得寻找 R 的替代品,例如在 bash 中:

for f in ctrl*.txt
do
  if [[ "$(head -1 ctrl__S162465_20190111_T8__3S_2DG_3mM_none.txt | md5)" != "$(head -1 $f | md5)" ]]
    then echo "$f"
  fi
done

此命令将“好文件”的列名与每个文件的列名进行比较,并打印出不匹配的文件名。

于 2020-10-27T00:51:46.237 回答