-2

我必须处理.txtFolder.like 内子文件夹中存在的文件:
New Folder>Folder 1 to 6>xx.txt & yy.txt(files present in each folder)
每个文件包含两列:

arg  his
asp  gln
glu  his

arg his
glu arg
arg his
glu asp

现在我要做的是:
1)计算每个文件的每个单词的出现次数>并通过除以2的平均总数total no. of lines in that file
)然后用完成第一步后获得的值,将这些值除以总数。文件夹中存在的用于平均的文件(即在这种情况下为 2)我已经尝试使用我的代码如下:
但我在第一种情况下成功但我没有得到第二种情况。

for root,dirs,files in os.walk(path):
    aspCount = 0
    glu_count = 0
    lys_count = 0
    arg_count = 0
    his_count = 0
    acid_count = 0
    base_count = 0
    count = 0
    listOfFile = glob.iglob(os.path.join(root,'*.txt')
    for filename in listOfFile:
        lineCount = 0
        asp_count_col1 = 0
        asp_count_col2 = 0
        glu_count_col1 = 0
        glu_count_col2 = 0
        lys_count_col1 = 0
        lys_count_col2 = 0
        arg_count_col1 = 0
        arg_count_col2 = 0
        his_count_col1 = 0
        his_count_col2 = 0
        count += 1
        for line in map(str.split,inp):
            saltCount += 1
            k = line[4]
            m = line[6]
            if k == 'ASP':
               asp_count_col1 += 1
            elif m == 'ASP':
               asp_count_col2 += 1
            if k == 'GLU':
               glu_count_col += 1
            elif m == 'GLU':
                glu_count_col2 += 1
            if k == 'LYS':
                lys_count_col1 += 1
            elif m == 'LYS':
                lys_count_col2 += 1
            if k == 'ARG':
                arg_count_col1 += 1
            elif m == 'ARG':
                arg_count_col2 += 1
            if k == 'HIS':
                his_count_col1 += 1
            elif m == 'HIS':
                his_count_col2 += 1
        asp_count = (float(asp_count_col1 + asp_count_col2))/lineCount   
        glu_count = (float(glu_count_col1 + glu_count_col2))/lineCount   
        lys_count = (float(lys_count_col1 + lys_count_col2))/lineCount   
        arg_count = (float(arg_count_col1 + arg_count_col2))/lineCount   
        his_count = (float(his_count_col1 + his_count_col2))/lineCount   

至此,我可以获得每个文件的平均值。但是我怎么能得到每个子文件夹的平均值(即除以计数(文件总数))。问题是第二部分。第一部分完成。提供的代码将平均每个文件的值。但是我想添加这个平均值并通过除以总数来得出一个新的平均值。子文件夹中存在的文件。

4

3 回答 3

1
import os
from collections import *

aminoAcids = set('asp glu lys arg his'.split())

filesToCounts = {}

for root,dirs,files in os.walk(subfolderPath):
    for file in files:
        if file.endswith('.txt'):
            path = os.path.join(root,file)
            with open(path) as f:
                acidsInFile = f.read().split()

            assert all(a in aminoAcids for a in acidsInFile)
            filesToCounts[file] = Counter(acidsInFile)

def averageOfCounts(counts):
    numberOfAcids = sum(counts.values())
    assert numberOfAcids%2==0
    numberOfAcidPairs = numberOfAcids/2
    return dict((acid,acidCount/numberOfAcidPairs) for acid,acidCount in counts.items())

filesToAverages = dict((file,averageOfCounts(counts)) for file,counts in filesToCounts.items())
于 2012-06-05T07:04:45.853 回答
0

os.walk与 with 一起使用glob.iglob是虚假的。要么使用一种,要么使用另一种,而不是同时使用。这是我的做法:

import os, os.path, re, pprint, sys
#...
for root, dirs, files in os.walk(path):
  counts = {}
  nlines = 0
  for f in filter(lambda n: re.search(r'\.txt$', n), files):
    for l in open(f, 'rt'):
      nlines += 1
      for k in l.split():
        counts[k] = counts[k]+1 if k in counts else 1
  for k, v in counts.items():
    counts[k] = float(v)/nlines

  sys.stdout.write('Frequencies for directory %s:\n'%root
  pprint.pprint(counts)
于 2012-06-05T06:59:21.400 回答
0

我喜欢 ninjagecko 的回答,但对这个问题的理解不同。以他的代码为起点,我提出以下建议:

import os
from collections import *

aminoAcids = set('asp glu lys arg his'.split())

subfolderFreqs = {}

for root,dirs,files in os.walk(subfolderPath):
    cumulativeFreqs = defaultdict(int)
    fileCount = 0
    for file in files:
        if file.endswith('.txt'):
            fileCount += 1
            path = os.path.join(root,file)
            with open(path) as f:
                acidsInFile = f.read().split()

            counts = Counter(acidsInFile)
            assert aminoAcids.issuperset(counts)
            numberOfAcidPairs = len(acidsInFile)/2
            for acid, acidCount in counts.items():
                cumulativeFreqs[acid] += float(acidCount) / numberOfAcidPairs
    if fileCount:
        subfolderFreqs[root] = {acid: cumulative/fileCount for acid, cumulative in cumulativeFreqs.items()}

print subfolderFreqs
于 2012-06-05T08:41:05.023 回答