python - 数组中值比较的pythonic方法？

Question

问题：

输入是一个制表符分隔的文件。行是变量，列是样本。变量可以采用三个值 (00,01,11)，并按照需要维护的顺序 (v1->vN) 列出。行数多，列数多，所以需要分块读取输入文件。

输入如下所示：
```
   s1 s2 s3 s4
v1 00 00 11 01
v2 00 00 00 00
v3 01 11 00 00
v4 00 00 00 00
(...)
```
我想要做的是将输入分成几行的片段，其中片段足够大，样本每个都是唯一的。在上面的例子中，从 v1 开始，第一个块应该在 v3 结束，因为在那个点有足够的信息，样本是唯一的。下一个块将从 v4 开始并重复该过程。到达最后一行时任务结束。块应打印在输出文件中。

我的尝试：

我试图做的是使用 csv 模块生成一个由列表组成的数组，每个列表包含所有样本的单个变量 (00,01,00) 的状态。或者，通过旋转输入，为每个变量创建包含样本状态的列表。我在问工作是否应该集中在列或行上，即是否更好地使用 v1=['00','00','11','01'] 或 s1=['00',' 00','01','00',...]

以下代码是指我试图将列问题更改为行问题的旋转操作。（对不起笨拙的python语法，这是我能做的最好的）

my_infilename='my_file.txt'
csv_infile=csv.reader(open(my_infilename,'r'), delimiter='\t')
out=open('transposed_'+my_infilename, 'w')
csv_infile=zip(*csv_infile)
line_n=0
for line in csv_infile:
line_n+=1
    if line_n==1:    #headers
        continue
    else:
        line=(','.join(line)+'\n')  #just to make it readable to me
        out.write(line)
out.close()

解决这个问题的最佳方法是什么？旋转有什么帮助吗？有没有我可以依赖的内置功能？

score 2 · Accepted Answer

假设您将 csv 数据导入为长度相同的列表列表，这对您有什么作用...

def get_block(data_rows):
    samples = []

    for cell in data_rows[0]:
        samples.append('')

    # add one row at a time to each sample and see if all are unique
    for row_index, row in enumerate(data_rows):
        for cell_index, cell in enumerate(row):
            samples[cell_index] = '%s%s' % (samples[cell_index], cell)

        are_all_unique = True
        sample_dict = {} # use dictionary keys to find repeats
        for sample in samples:
            if sample_dict.get(sample):
                # already there, so another row needed
                are_all_unique = False
                break
            sample_dict[sample] = True # add the key to the dictionary
        if are_all_unique:
            return True, row_index

    return False, None

def get_all_blocks(all_rows):
    remaining_rows = all_rows[:] # make a copy    
    blocks = []

    while True:
        found_block, block_end_index = get_block(remaining_rows)
        if found_block:
            blocks.append(remaining_rows[:block_end_index+1])
            remaining_rows = remaining_rows[block_end_index+1:]
            if not remaining_rows:
                break
        else:
            blocks.append(remaining_rows[:])
            break

    return blocks


if __name__ == "__main__":
    v1 = ['00', '00', '11', '01']
    v2 = ['00', '00', '00', '00']
    v3 = ['01', '11', '00', '00']
    v4 = ['00', '00', '00', '00']

    all_rows = [v1, v2, v3, v4]

    blocks = get_all_blocks(all_rows)

    for index, block in enumerate(blocks):
        print "This is block %s." % index
        for row in block:
            print row
        print

==================

这是块 0。

['00', '00', '11', '01']

['00', '00', '00', '00']

['01', '11', '00', '00']

这是块 1。

['00', '00', '00', '00']

score 0 · Accepted Answer

我根本不理解您的问题（“协调变量”？“明确地确定样本”？），但我知道您错误地使用了 csv 模块并且您的缩进也不正确。

我不确切知道您输入的文件是什么样的，但假设它是制表符分隔的，下面的（未经测试的）脚本显示了一种从输入文件中获取块、转动它们并重写到输出文件的方法。

import csv

# this is not strictly necessary, but you can define a custom dialect for input and output

class SampleDialect (csv.Dialect):
    delimiter = "\t"
    quoting = csv.QUOTE_NONE    

sampledialect = SampleDialect()

ifn = 'my_file.txt'
ofn = 'transposed_'+ifn

ifp = open(ifn, 'rb')
ofp = open(ofn, 'wb')

incsv = csv.reader(ifp, dialect=sampledialect)
outcsv = csv.writer(ofp, dialect=sampledialect)


header = None
block = []
for lineno, samples in enumerate(incsv):
    if lineno==0: #header
        header = samples
        continue
    block.append(samples)
    if lineno%3:
        # end of block
        # do something with block
        # then write it out
        outcsv.writerows(block)
        block = []

ifp.close()
ofp.close()

python - 数组中值比较的pythonic方法？

2 回答 2

Related

Reference