r - R：如何根据匹配的特定列标题在文件夹中查找选择文件

Question

对不起，一般的问题。我正在寻找用于整理数据文件夹的指针，其中有许多 .txt 文件。它们都有不同的标题，并且对于绝大多数文件来说，文件具有相同的维度，即列号相同。然而，痛苦是一些文件，尽管列数相同，但列名不同。也就是说，在这些文件中，测量了其他一些变量。

我想清除这些文件，而我不能通过简单地比较列号来做到这一点。有什么方法可以传递列的名称并检查目录中有多少文件具有该列，以便我可以将它们删除到不同的文件夹中？

更新：

我创建了一个虚拟文件夹来包含反映问题的文件，请参阅下面的链接以访问我的谷歌驱动器上的文件。在这个文件夹中，我选取了 4 个包含问题列的文件。

https://drive.google.com/drive/folders/1IDq7BwfQNkGb9y3RvwlLE3FeMQc38taD?usp=sharing

问题是代码似乎能够找到匹配选择标准的文件，也就是问题列的实际名称，但我无法提取列表中此类文件的真实索引。任何指针？

library(data.table)

#read in the example file that have the problem column content
df_var <- read.delim("ctrl_S3127064__3S_DMSO_00_none.TXT", header = T, sep = "\t")

#read in a file that I want to use as reference
df_standard <- read.delim("ctrl__S162465_20190111_T8__3S_2DG_3mM_none.TXT", header = T, sep = "\t")

#get the names of columns of each file
standar.names <- names(df_standard)
var.names <- names(df_var)

same.titles <- var.names %in% standar.names

dff.titles <- !var.names %in% standar.names

#confirm the only 3 columns of problem is column 129,130 and 131 
mismatched.names <- colnames(df_var[129:131])

#visual check the names of the problematic columns
mismatched.names


# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)

# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
  l_files[[i]] <- read.delim(file = files_in_wd[i],
                         sep = "\t",
                         header = T,
                         nrows = 2)
}

# get column names of all files
column_names <- lapply(l_files, names)

# get unique names of files
unique_names <- unique(mismatched.names)
unique_names[1]
# decide which files to remove
#here there the "too_keep" returns an integer vector that I don't undestand
#I thought the numbers should represent the ID/index of the elements
#but I have less than 10 files, but the numbers in to_keep are around 1000
#this is probably because it's matching the actually index of the unlisted list
#but if I use to_keep <- which(column_names%in% unique_names[1]) it returns empty vector

to_keep <- which(unlist(column_names)%in% unique_names[1])


#now if I want to slice the file using to_keep the files_to_keep returns NA NA NA
files_to_keep <- files_in_wd[to_keep]

#once I have a list of targeted files, I can remove them into a new folder by using file.remove
library(filesstrings)
file.move(files_to_keep, "C:/Users/mli/Desktop/weeding/need to reanalysis" )

score 1 · Accepted Answer

如果您可以根据列名区分要保留的文件和要删除的文件，则可以使用以下内容：

# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files")

# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)

# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
  l_files[[i]] <- read.delim(file = files_in_wd[i],
                             sep = ';',
                             header = T,
                             nrows = 2)
}

# get column names of all files
column_names <- lapply(l_files, names)
# get unique names of files
unique_names <- unique(column_names)
# decide which files to keep
to_keep <- which(column_names %in% unique_names[1])

files_to_keep <- files_in_wd[to_keep]

如果你有很多文件，你可能应该避免循环或者只是读入相应文件的标题。

在您发表评论后编辑：

通过添加 nrows = 2 代码仅读取前 2 行 + 标题。
我假设文件夹中的第一个文件具有您想要保留的结构，这就是为什么 column_names 与 unique_names[1] 进行检查的原因。
files_to_keep 包含您要保留的文件的名称
您可以尝试在数据的子集上运行它，看看它是否有效，然后再担心效率。我认为矢量化方法可能会更好。

编辑：此代码适用于您的虚拟数据。

library(filesstrings)

# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files/dummyset")

# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)

# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
  l_files[[i]] <- read.delim(file = files_in_wd[i],
                             sep = "\t",
                             header = T,
                             nrows = 2,
                             encoding = "UTF-8",
                             check.names = FALSE
                            )
}

# get column names of all files
column_names <- lapply(l_files, names)
# decide which files to keep
to_keep <- column_names[[1]] # e.g. column names of file #1 are ok

# check if the other files have the same header:
df_filehelper <- data.frame('fileindex' = seq_along(files_in_wd),
  'filename' = files_in_wd,
  'keep' = NA)

for(i in 2:length(files_in_wd)){
  df_filehelper$keep[i] <- identical(to_keep, column_names[[i]])
}

df_filehelper$keep[1] <- TRUE # keep the original file used for selecting the right columns

# move files out of the current folder:
files_to_move <- df_filehelper$filename[!df_filehelper$keep] # selects file that are not to be kept

file.move(files_to_move, "C:/Users/tester/Desktop/generic-text-files/dummyset/testsubfolder/")

score 0 · Accepted Answer

由于文件的数量和大小很大，可能值得寻找 R 的替代品，例如在 bash 中：

for f in ctrl*.txt
do
  if [[ "$(head -1 ctrl__S162465_20190111_T8__3S_2DG_3mM_none.txt | md5)" != "$(head -1 $f | md5)" ]]
    then echo "$f"
  fi
done

此命令将“好文件”的列名与每个文件的列名进行比较，并打印出不匹配的文件名。

r - R：如何根据匹配的特定列标题在文件夹中查找选择文件

2 回答 2

Related

Reference