0

Question relates to the 'Compute row means.' code block below. This code was originally written with six samples in mind, and I'm trying to scale it to n number of samples.

Each csv file is a separate patient file, and within:

| gene  | expression |
| ---   | ---        |
| A1BG  | 1.444      | 
| A1CF  | 4.303      |
| A2BP1 | 11.117     |

The original file list has been changed to accept command-line arguments to scale, but I have no idea where to proceed next. I need to pull each sample name and use it in that code block, while also correctly incrementing the slice notation within each separate list comprehension. Any ideas?

import csv
import matplotlib.pyplot as plt
import sys

"""
This is an implementation of quantile normalization for microarray data analysis.
"""

# Parse csv files for samples, creating lists of gene names and expression values.
#file_list =  ['genes1.csv', 'genes2.csv', 'genes3.csv', 'genes4.csv', 'genes5.csv',
#                'genes6.csv']
while True:
    if (len(sys.argv) > 1):
        file_list = [args for args in sys.argv[1:]]
        print file_list
        break
    else:
        print "Not enough arguments given."
        break

set_dict = {}
for path in file_list:
    with open(path) as stream:
        data = list(csv.reader(stream, delimiter = '\t'))
    data = sorted([(i, float(j)) for i, j in data], key = lambda v: v[1])
    sample_genes = [i for i, j in data]
    sample_values = [j for i, j in data]
    set_dict[path] = (sample_genes, sample_values)

# Create sorted list of genes and values for all datasets.
set_list = [x for x in set_dict.items()]
set_list.sort(key = lambda (x,y): file_list.index(x))

This is the code block which needs to be scaled to handle any number of samples given as parameters in the CLI:

# Compute row means.
mean_values = [((a + b + c + d + e + f)/len(file_list)) 
                for i, (a, b, c, d, e, f) in 
                enumerate(zip([v for i, (j, k) in set_list[:1] for v in k], 
                [v for i, (j, k) in set_list[1:2] for v in k], 
                [v for i, (j, k) in set_list[2:3] for v in k], 
                [v for i, (j, k) in set_list[3:4] for v in k], 
                [v for i, (j, k) in set_list[4:5] for v in k], 
                [v for i, (j, k) in set_list[5:6] for v in k]))]

The corrected solution given below by @Bo102010:

L = len(file_list)
all_sets = [set_list[i - 1: i] for i in range(1, L + 1)]
all_values = [[v for i, (j, k) in A for v in k] for A in all_sets]
mean_values = [sum(p) / L for p in zip(*all_values)]
4

1 回答 1

1

如果我正确理解了您的代码块,那么您应该能够使用“星号”来解压缩可迭代对象。在调用zip时使用它。

L = len(file_list)
all_sets = [set_list[i - 1: i] for i in range(1, L + 1)]
all_values = [[v for i, (j, k) in A for v in k] for A in all_sets]
mean_values = [sum(p) / L for p in zip(*all_values)]
于 2013-09-16T12:19:15.007 回答