python - Batch (basename) file/folder renaming using an "index"

Question

Renaming of files and folder in batch is a question often asked but after some search I think none is similar to mine.

Background: we send some biological samples to a service provider which returns files with unique names and a table in text format containing, amongst other information, the file name and the sample that originated it:

head samples.txt
fq_file Sample_ID   Sample_name Library_ID  FC_Number   Track_Lanes_Pos
L2369_Track-3885_R1.fastq.gz    S1746_B_7_t B 7 t   L2369_B_7_t 163 6
L2349_Track-3865_R1.fastq.gz    S1726_A_3_t A 3 t   L2349_A_3_t 163 5
L2354_Track-3870_R1.fastq.gz    S1731_A_GFP_c   A GFP c L2354_A_GFP_c   163 5
L2377_Track-3893_R1.fastq.gz    S1754_B_7_c B 7 c   L2377_B_7_c 163 7
L2362_Track-3878_R1.fastq.gz    S1739_B_GFP_t   B GFP t L2362_B_GFP_t   163 6

The directory structure (for 34 directories):

L2369_Track-3885_
   accepted_hits.bam      
   deletions.bed   
   junctions.bed         
   logs
   accepted_hits.bam.bai  
   insertions.bed  
   left_kept_reads.info
L2349_Track-3865_
   accepted_hits.bam      
   deletions.bed   
   junctions.bed         
   logs
   accepted_hits.bam.bai  
   insertions.bed  
   left_kept_reads.info

Goal: because the file names are meaningless and hard to interpret, I want to rename the files ending in .bam (keeping the suffix) and the folders with the correspondent sample name, re-ordered in a more suitable manner. The result should look like:

7_t_B
   7_t_B..bam      
   deletions.bed   
   junctions.bed         
   logs
   7_t_B.bam.bai  
   insertions.bed  
   left_kept_reads.info
3_t_A
   3_t_A.bam      
   deletions.bed   
   junctions.bed         
   logs
   accepted_hits.bam.bai  
   insertions.bed  
   left_kept_reads.info

I've hacked together a solution with bash and python (newbie) but it feels over-engineered. The question is whether there is a more simple/elegant way of doing it that I've missed? Solutions can be in python, bash, and R. could also be awk since I am trying to learn it. Being a relative beginner does make one complicate things.

This is my solution:

A wrapper puts it all in place and gives an idea of the workflow:

#! /bin/bash

# select columns of interest and write them to a file - basenames
tail -n +2 samples.txt |  cut -d$'\t' -f1,3 >> BAMfilames.txt 

# call my little python script that creates a new .sh with the renaming commmands
./renameBamFiles.py

# finally do the renaming
./renameBam.sh

# and the folders to
./renameBamFolder.sh

renameBamFiles.py:

#! /usr/bin/env python
import re

# Read in the data sample file and create a bash file that will remane the tophat output 
# the reanaming will be as follows:
# mv L2377_Track-3893_R1_ L2377_Track-3893_R1_SRSF7_cyto_B
# 

# Set the input file name
# (The program must be run from within the directory 
#  that contains this data file)
InFileName = 'BAMfilames.txt'


### Rename BAM files

# Open the input file for reading
InFile = open(InFileName, 'r')


# Open the output file for writing
OutFileName= 'renameBam.sh'

OutFile=open(OutFileName,'a') # You can append instead with 'a'

OutFile.write("#! /bin/bash"+"\n")
OutFile.write(" "+"\n")


# Loop through each line in the file
for Line in InFile:
    ## Remove the line ending characters
    Line=Line.strip('\n')

    ## Separate the line into a list of its tab-delimited components
    ElementList=Line.split('\t')

    # separate the folder string from the experimental name
    fileroot=ElementList[1]
    fileroot=fileroot.split()

    # create variable names using regex
    folderName=re.sub(r'^(.*)(\_)(\w+).*', r'\1\2\3\2', ElementList[0])
    folderName=folderName.strip('\n')
    fileName = "%s_%s_%s" % (fileroot[1], fileroot[2], fileroot[0])

    command= "for file in %s/accepted_hits.*; do mv $file ${file/accepted_hits/%s}; done" % (folderName, fileName)

    print command
    OutFile.write(command+"\n")  


# After the loop is completed, close the files
InFile.close()
OutFile.close()


### Rename folders

# Open the input file for reading
InFile = open(InFileName, 'r')


# Open the output file for writing
OutFileName= 'renameBamFolder.sh'

OutFile=open(OutFileName,'w') 

OutFile.write("#! /bin/bash"+"\n")
OutFile.write(" "+"\n")


# Loop through each line in the file
for Line in InFile:
    ## Remove the line ending characters
    Line=Line.strip('\n')

    ## Separate the line into a list of its tab-delimited components
    ElementList=Line.split('\t')

    # separate the folder string from the experimental name
    fileroot=ElementList[1]
    fileroot=fileroot.split()

    # create variable names using regex
    folderName=re.sub(r'^(.*)(\_)(\w+).*', r'\1\2\3\2', ElementList[0])
    folderName=folderName.strip('\n')
    fileName = "%s_%s_%s" % (fileroot[1], fileroot[2], fileroot[0])

    command= "mv %s %s" % (folderName, fileName)

    print command

    OutFile.write(command+"\n")  


# After the loop is completed, close the files
InFile.close()
OutFile.close()

RenameBam.sh - created by the previous python script:

#! /bin/bash

for file in L2369_Track-3885_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/7_t_B}; done
for file in L2349_Track-3865_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/3_t_A}; done
for file in L2354_Track-3870_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/GFP_c_A}; done
(..)

Rename renameBamFolder.sh is very similar:

mv L2369_Track-3885_R1_ 7_t_B
mv L2349_Track-3865_R1_ 3_t_A
mv L2354_Track-3870_R1_ GFP_c_A
mv L2377_Track-3893_R1_ 7_c_B

Since I am learning, I feel that some examples of different ways of doing this, and thinking about how to do it, will be very useful.

score 2 · Accepted Answer

bash 中的一种简单方法：

find . -type d -print |
while IFS= read -r oldPath; do

   parent=$(dirname "$oldPath")
   old=$(basename "$oldPath")
   new=$(awk -v old="$old" '$1~"^"old{print $4"_"$5"_"$3}' samples.txt)

   if [ -n "$new" ]; then
      newPath="${parent}/${new}"
      echo mv "$oldPath" "$newPath"
      echo mv "${newPath}/accepted_hits.bam" "${newPath}/${new}.bam"
   fi
done

在初始测试后删除“echo”以使其实际执行“mv”。

如果您的所有目标目录都像@triplee 的答案所暗示的那样处于一个级别，那么它就更简单了。只需 cd 到他们的父目录并执行以下操作：

awk 'NR>1{sub(/[^_]+$/,"",$1); print $1" "$4"_"$5"_"$3}' samples.txt |
while read -r old new; do
   echo mv "$old" "$new"
   echo mv "${new}/accepted_hits.bam" "${new}/${new}.bam"
done

在您预期的输出之一中，您重命名了“.bai”文件，而在另一个输出中您没有，并且您没有说是否要这样做。如果您也想重命名它，只需添加

echo mv "${new}/accepted_hits.bam.bai" "${new}/${new}.bam.bai"

到您喜欢的上述任何解决方案。

score 0 · Accepted Answer

这是使用 shell 脚本的一种方法。像这样运行：

script.sh /path/to/samples.txt /path/to/data

内容script.sh：

# add directory names to an array
while IFS= read -r -d '' dir; do

    dirs+=("$dir")

done < <(find $2/* -type d -print0)


# process the sample list
while IFS=$'\t' read -r -a list; do

    for i in "${dirs[@]}"; do

        # if the directory is in the sample list
        if [ "${i##*/}" == "${list[0]%R1.fastq.gz}" ]; then

            tag="${list[3]}_${list[4]}_${list[2]}"
            new="${i%/*}/$tag"
            bam="$new/accepted_hits.bam"

            # only change name if there's a bam file
            if [ -n $bam ]; then

                mv "$i" "$new"
                mv "$bam" "$new/$tag.bam"
            fi
        fi
    done

done < <(tail -n +2 $1)

score 0 · Accepted Answer

虽然它不是您正在寻找的东西（只是跳出框框思考）：您可能会考虑文件系统的替代“视图” - 使用术语“视图”就像数据库视图对表一样。您可以通过“用户空间中的文件系统”FUSE 来做到这一点。可以使用许多现有的实用程序来做到这一点，但我不知道有一个通常适用于任何一组文件，特别是用于重命名/重新组织。但作为如何使用它的具体示例，pytagsfs创建了一个虚拟（熔断）文件系统根据您定义的规则，使文件的目录结构以您想要的方式出现。（也许这也适用于您——但 pytagsfs 实际上是为媒体文件设计的。）然后您只需使用通常访问该数据的任何程序在该（虚拟）文件系统上进行操作。或者，要使虚拟目录结构永久化（如果 pytagsfs 没有执行此操作的选项），只需将虚拟文件系统复制到另一个目录（虚拟文件系统之外）。

score 0 · Accepted Answer

似乎您可以在一个简单的while循环中简单地从索引文件中读取所需的字段。文件的结构并不明显，所以我假设文件是空格分隔的，Sample_Id实际上是四个字段（复杂的 sample_id，然后是名称中的三个组件）。也许您有一个制表符分隔的文件，字段中有内部空格Sample_Id？无论如何，如果我的假设是错误的，这应该很容易适应。

# Skip the annoying field names
tail +1 samples.txt |
while read fq _ c a b chaff; do
    dir=${fq%R1.fastq.gz}
    new="${a}_${b}_$c"
    echo mv "$dir"/accepted_hits.bam "$dir/$new".bam
    echo mv "$dir"/accepted_hits.bam.bai "$dir/$new".bam.bai
    echo mv "$dir" "$new"
done

echo如果输出看起来像您想要的，请取出s。

score 0 · Accepted Answer

当然，您只能在 Python 中执行此操作 - 它可以为此生成一个小的可读脚本。

首先：读取 sampels.txt 文件并创建从现有文件前缀到所需映射前缀的映射 - 该文件未格式化为使用 Python CSV 阅读器模块，因为在最后一个数据列中使用了列分隔符。

mapping = {}
with open("samples.txt") as samples:
   # throw away headers
   samples.readline()
   for line in samples():
       # separate the columns spliting the first  whitespace ocurrences:
       # (either space sequences or tabs)
       fields = line.split()
       # skipp blank, malformed lines:
       if len(fields) < 6: 
           continue
       fq_file, sample_id, Sample_name, Library_ID,  FC_Number,  track_lanes_pos, *other = fields
       # the [:-2] part is to trhow awauy the "R1"  sufix as for the example above
       file_prefix = fq_file.split(".")[0][:-2]
       target_id = "_".join((Library_ID, FC_number. Sample_name))
       mapping[file_prefix] = target_id

然后检查目录名称，并在每个目录中检查“.bam”文件以进行重新映射。

import os
for entry in os.listdir("."):
     if entry in mapping:
         dir_prefix = "./" + entry + "/")
         for file_entry in os.listdir(dir_prefix):
              if ".bam" in file_entry:
                   parts = file_entry.split(".bam")
                   parts[0] = mapping[entry]
                   new_name = ".bam".join(parts)

                   os.rename(dir_prefix + file_entry, dir_prefix + new_name)
         os.rename(entry, mapping[entry])

python - Batch (basename) file/folder renaming using an "index"

5 回答 5

Related

Reference