0

我有一个像这样组织的表格(curves.csv)(没有组织会更好的描述)

CL,D,PD,CL,D,PD,CL,D,PD,CL,D,PD,CL,D,PD
A,1,a,B,1,b,C,1,c,D,1,d,E,1,e
A,2,f,B,3,g,C,2,h,D,4,i,E,2,j
A,5,k,B,6,l,C,5,m,D,8,n,E,5,o

我想将此表转换为

,A,B,C,D,E
1,a,b,c,d,e
2,f,,h,,j
3,,g,,,
4,,,,i,
5,k,,m,,o
6,,l,,,
8,,,,n,

我目前有这个:

celllines=["A","B","C","D","E"]
sorted_days=["1","2","3","4","5","8"]
for d in sorted_days:
    curves=open("curves.csv","rU")
    for line in curves:
        line=line.rstrip().rsplit(",")
        if line[0]!="CL":#removes header
            for x in range(0,len(line),3):
                if line[x] in celllines:
                    if line[x+1] == d:
                        print d,line[x],line[x+2]
                    else:
                        print d, line[x],""



    curves.close()

我只是觉得我正在进一步形成答案,而不是更接近!一如既往,任何指针都会非常感谢

4

5 回答 5

2

我发现解决此类问题的最佳方法是将旧格式的分解和新格式的构建分开。取而代之的是,将旧格式分解为一个健全的数据结构,以便在 Python 中轻松使用数据,然后使用这种良好的、可延展的结构构建新格式。

无论我们在哪里使用逗号分隔值,我们都可以通过使用标准库中csv模块来简化这一切,并大大简化了这种工作。

该解决方案还大量使用了列表理解(以及它的各种表亲),因此如果您不熟悉它们,我建议您阅读一下(之前链接的是我解释它们的简短视频)。

import csv
import itertools

def grouper(n, iterable, fillvalue=None):
    args = [iter(iterable)] * n
    return itertools.zip_longest(fillvalue=fillvalue, *args)

with open("curves.csv") as file:
    data = csv.reader(file)
    next(data) #Ignore header row.
    parsed = {(column, row): value for line in data
              for column, row, value in grouper(3, line)}

rows = sorted({row for (_, row) in parsed})
columns = sorted({column for (column, _) in parsed})

with open("output.csv", "w") as file:
    writer = csv.writer(file)
    writer.writerow([None] + columns)
    writer.writerows([[row]+[parsed.get((column, row))
                             for column in columns]
                      for row in rows])

我们首先使用with语句打开文件(确保文件关闭的最佳实践),然后跳过标题行并解析数据。为此,我们获取数据中的每一行,然后将行分组为长度为 3 的块(使用grouper()函数,它是一个itertoolsrecipie)。这为我们提供了列、行和值,然后我们将其用作字典的键和值。

这给了我们一个字典{("A", 1): "a", ...}。这是一种很好的工作格式,所以现在我们将文件重新构建为所需的格式。

首先我们需要知道我们需要哪些行和列,我们只从解析的数据中获取行,并创建一个集合(因为集合不能包含重复项),最后将它们重新排序到一个列表中,这样我们就有了正确的顺序。

然后我们打开我们的输出文件,并将列写入其中(记住None为行标题列添加一个),然后写出我们的数据。对于每一行,我们写入行号,然后从我们解析的数据中获取每一列的值,如果没有值,则使用dict.get()so 我们得到。None这给出了想要的输出。

注意:看来您在问题中使用的是 Python 2.x,我的答案是用 3.x 编写的。唯一的区别应该itertools.zip_longest()itertools.izip_longest()在 3.x 中。

于 2013-01-10T23:41:31.153 回答
2

像这样的东西怎么样,使用csv模块:

import csv

# make a dictionary to store the data
data = {}

# first, read it in
with open("curves.csv", "rb") as fp:

    # make a csv reader object
    reader = csv.reader(fp)

    # skip initial line
    next(reader)

    for row in reader:
        # for each triplet, store it in the dictionary
        for i in range(len(row)//3):
            CL, D, PD = row[3*i:3*i+3]
            data[D, CL] = PD

# see what we've got
print data

with open("newcurves.csv", "wb") as fp:
    # get the labels in order
    row_labels = sorted(set(k[0] for k in data), key=int)
    col_labels = sorted(set(k[1] for k in data))

    writer = csv.writer(fp)
    # write header
    writer.writerow([''] + col_labels)

    # write data rows
    for row_label in row_labels:
        # start with the label
        row = [row_label]

        # then extend a list of the data in order, using the empty string '' if
        # there's no such value
        row.extend([data.get((row_label, col_label), '') for col_label in col_labels])

        # dump it out
        writer.writerow(row)

这给了我们一个看起来像的字典

{('1', 'D'): 'd', ('1', 'E'): 'e', ('5', 'C'): 'm', ('1', 'B'): 'b', ('2', 'E'): 'j', ('1', 'C'): 'c', ('5', 'A'): 'k', ('6', 'B'): 'l', ('2', 'C'): 'h', ('1', 'A'): 'a', ('4', 'D'): 'i', ('8', 'D'): 'n', ('2', 'A'): 'f', ('3', 'B'): 'g', ('5', 'E'): 'o'}

和一个输出文件,如

~/coding$ cat newcurves.csv 
,A,B,C,D,E
1,a,b,c,d,e
2,f,,h,,j
3,,g,,,
4,,,,i,
5,k,,m,,o
6,,l,,,
8,,,,n,
于 2013-01-10T23:23:56.463 回答
2

只是为了表明(有点晚)它也可以在 R 中完成:

curves <- read.csv("curves.csv", as.is = TRUE)
stack  <- data.frame(CL = unlist(curves[, c(TRUE, FALSE, FALSE)]),
                     D  = unlist(curves[, c(FALSE, TRUE, FALSE)]),
                     PD = unlist(curves[, c(FALSE, FALSE, TRUE)]),
                     stringsAsFactors = FALSE)
library(reshape2)
output <- acast(stack, D ~ CL, value.var = "PD", fill = "")
write.csv(output, "new_curves.csv", quote = FALSE)

如果你不喜欢使用第三方包,那么你可以用 base 来做这一切:

curves   <- read.csv("curves.csv", as.is = TRUE)
rownames <- sort(unique(unlist(curves[, c(FALSE, TRUE, FALSE)])))
colnames <- sort(unique(unlist(curves[, c(TRUE, FALSE, FALSE)])))
output   <- matrix("", nrow = length(rownames), ncol = length(colnames),
                       dimnames = list(rownames, colnames))
fill.i   <- match(unlist(curves[, c(FALSE, TRUE, FALSE)]), rownames)
fill.j   <- match(unlist(curves[, c(TRUE, FALSE, FALSE)]), colnames)
fill.x   <- unlist(curves[, c(FALSE, FALSE, TRUE)])
output[cbind(fill.i, fill.j)] <- fill.x
write.csv(output, "new_curves.csv", quote = FALSE)
于 2013-01-12T02:31:11.183 回答
1

R 解决方案与tapply-ing 连接函数,c。

cvrs <- read.table(text="CL,D,PD,CL,D,PD,CL,D,PD,CL,D,PD,CL,D,PD
 A,1,a,B,1,b,C,1,c,D,1,d,E,1,e
 A,2,f,B,3,g,C,2,h,D,4,i,E,2,j
 A,5,k,B,6,l,C,5,m,D,8,n,E,5,o", header=TRUE, sep=",", check.names=FALSE)

long <- rbind(crvs[, 1:3], crvs[, 4:6], crvs[, 7:9], crvs[, 10:12])
out <- with( long, tapply(PD, list(D, CL), FUN=c) )
#-----------------
 write.table(out, quote=FALSE, sep=",", na="")
A,B,C,D
1,a,b,c,d
2,f,,h,
3,,g,,
4,,,,i
5,k,,m,
6,,l,,
8,,,,n
于 2013-01-12T05:09:54.080 回答
1

不使用csv模块:

celllines=["","A","B","C","D","E"]
days=["1","2","3","4","5","6","7","8"]

curves = sum([line.split(',') for line in open("curves.csv","rU").read().split()[1:]], [])

group = {(d,cl): pd for (cl,d,pd) in [curves[i:i+3] for i in range(0,len(curves),3)]}
table = [[d if not x else '' for x in celllines] for d in days]

for (d,cl),pd in group.items():
    table[days.index(d)][celllines.index(cl)] = pd

with open("curves2.csv", "w") as f:
    f.write('\n'.join(','.join(line) for line in [celllines]+table))
于 2013-01-11T01:03:45.657 回答