0

这就是我正在做的

import csv
output = open('output.txt' , 'wb')


# this functions return the min for num.txt
def get_min(num):
    return int(open('%s.txt' % num, 'r+').readlines()[0])

# temporary variables
last_line = ''
input_list = []

#iterate over input.txt in sort the input in a list of tuples 
for i, line in enumerate(open('input.txt', 'r+').readlines()): 
    if i%2 == 0: 
        last_line = line
    else:
        input_list.append((last_line, line))
filtered = [(header, data[:get_min(header[-2])] + '\n' ) for (header, data) in input_list]
[output.write(''.join(data)) for data in filtered]
output.close()

在这段代码input.txt中是这样的

>012|013|0|3|M
AFDSFASDFASDFA
>005|5|67|0|6
ACCTCTGACC
>029|032|4|5|S
GGCAGGGAGCAGGCCTGTA

num.txt是这样的

M 4
P 10

我希望在上面通过查看与 in 相同的最后一列来input.txt检查来自 的值的数量,并根据该值剪切其字符num.txtnum.txt

我认为我的代码中的错误是它只接受整数文本文件,它也应该接受包含字母的文件

4

2 回答 2

1

经过与 OP 长时间的交谈,完全修改后的版本;

import os
import re

# Fetch all hashes and counts
file_c = open('num.txt')
file_c = file_c.read()
lines = re.findall(r'\w+\.txt \d+', file_c)
numbers = {}

for line in lines:
    line_split = line.split('.txt ')
    hash_name = line_split[0]
    count = line_split[1]

    numbers[hash_name] = count

#print(numbers)

# The input file
file_i = open('input.txt')
file_i = file_i.read()

for hash_name, count in numbers.iteritems():
    regex = '(' + hash_name.strip() + ')'
    result = re.findall(r'>.*\|(' + regex + ')(.*?)>', file_i, re.S)

    if len(result) > 0:
        data_original = result[0][2]
        stripped_data = result[0][2][int(count):]

        file_i = file_i.replace(data_original, '\n' + stripped_data)

        #print(data_original)
        #print(stripped_data)

#print(file_i)


# Write the input file to new input_new.txt
f = open('input_new.txt', 'wt')
f.write(file_i)
于 2013-04-10T22:06:31.790 回答
1

你可以这样做;

import re

min_count = 4 # this variable will contain that count integer from where to start removing
str_to_match = 'EOG6CC67M' # this variable will contain the filename you read

input = '' # The file input (input.txt) will go in here

counter = 0

def callback_f(e):
    global min_count
    global counter
    counter += 1
    # Check your input
    print(str(counter) + ' >>> ' + e.group())

    # Only replace the value with nothing (remove it) after a certain count
    if counter > min_count:
        return '' # replace with nothing

result = re.sub(r''+str_to_match, callback_f, input)

使用这种策略,您可以使用全局计数器进行计数,并且无需使用复杂结构进行硬线循环。

更新

更详细的文件访问版本;

import os
import re

def callback_f(e):
    global counter
    counter += 1
    # Check your input
    print(str(counter) + ' >>> ' + e.group())


# Fetch all hash-file names and their content (count)
num_files = os.listdir('./num_files')
numbers = {}

for file in num_files:
    if file[0] != '.':
        file_c = open('./num_files/' + file)
        file_c = file_c.read()

        numbers[file.split('.')[0]] = file_c


# Now the CSV files
csv_files = os.listdir('./csv_files')

for file in csv_files:
    if file[0] != '.':
        for hash_name, min_count in numbers.iteritems():
            file_c = open('./csv_files/' + file)
            file_c = file_c.read()

            counter = 0

            result = re.sub(r''+hash_name, callback_f, file_c)

            # Write the replaced content back to the file here

考虑的目录/文件结构;

 + Projects
   + Project_folder
     + csv_files
       - input1.csv
       - input2.csv
       ~ etc.
     + num_files
       - EOG6CC67M.txt
       - EOG62JQZP.txt
       ~ etc.
     - python_file.py
  • CSV 文件包含您在原始问题中陈述的大量文本。
  • Num 文件包含带有 Integer 的哈希文件

这个脚本会发生什么;

  1. 收集所有哈希文件(在字典中)及其内部计数
  2. 循环遍历所有 CSV 文件
  3. 对每个 CSV 文件收集的数字进行子循环
  4. callback_f()在一定计数后替换/删除(基于您在其中所做的)哈希
  5. 写回输出(这是脚本中的最后一条注释,将包含file.write()功能)
于 2013-04-10T18:18:47.823 回答