0

我对 Python 比较陌生,并尝试使用它来合并两个包含 4 列的排序文件:

文件 1:

x-coordinate, y-coordinate, data 1, data 2  
1, 10, 20, 0  
5, 15, 1, 2  
...

文件 2:

x-coordinate, y-coordinate, data 3, data 4  
1, 10, 7, 8  
3, 25, 1, 2  
...

放入一个包含 6 列的单个排序文件,其中包含每个唯一的 (x,y) 坐标集:

x-coordinate, y-coordinate, data 1, data 2, data 3, data 4  
1, 10, 20, 0, 7, 8  
3, 25, 0, 0, 1, 2  
5, 15, 1, 2, 0, 0  

如果仅输出文件的顺序无关紧要,我认为使用字典这项任务将是微不足道的。由于我的输入文件有 100 行长,我试图想出一种有效的“pythonic”方式来处理重复的情况(即,两个文件中都存在相同的 (x,y) 坐标),但到目前为止我难住了。

任何和所有的帮助表示赞赏。提前致谢!

4

2 回答 2

3

我可能会使用这样defaultdict的东西:

from collections import defaultdict
from itertools import chain   

d = defaultdict(lambda:[0,0,0,0])
with open('file1') as f1, open('file2') as f2:
    next(f1) #get rid of header info
    next(f2)
    for line1,line2 in zip(f1,f2):
        data1 = [int(x) for x in line1.split(',')]
        data2 = [int(x) for x in line2.split(',')]
        d[tuple(data1[:2])][:2] = data1[2:]
        d[tuple(data2[:2])][2:] = data2[2:]

#now sort the items and write them out:
#This puts them in stdout, but you could easily use `file.write` here.
for k,v in sorted(d.items()):
    print(', '.join(str(x) for x in chain(k,v)))
于 2013-04-01T16:55:16.800 回答
2

使用熊猫

import pandas as pd

df1 = pd.read_csv("coord1.csv")
df2 = pd.read_csv("coord2.csv")
combined = df1.merge(df2, how='outer').fillna(0)
combined.sort(list(combined.columns[:2]), inplace=True)
combined.to_csv("coord_merged.csv",index=False)

首先,读入原始数据:

>>> import pandas as pd
>>> df1 = pd.read_csv("coord1.csv")
>>> df2 = pd.read_csv("coord2.csv")
>>> df1
   x-coordinate   y-coordinate   data 1   data 2
0             1             10       20        0
1             5             15        1        2
>>> df2
   x-coordinate   y-coordinate   data 3   data 4  
0             1             10        7          8
1             3             25        1          2

合并它们,并用零填充缺失的数据:

>>> combined = df1.merge(df2, how='outer')
>>> combined
   x-coordinate   y-coordinate   data 1   data 2   data 3   data 4  
0             1             10       20        0        7          8
1             5             15        1        2      NaN        NaN
2             3             25      NaN      NaN        1          2
>>> combined = df1.merge(df2, how='outer').fillna(0)
>>> combined
   x-coordinate   y-coordinate   data 1   data 2   data 3   data 4  
0             1             10       20        0        7          8
1             5             15        1        2        0          0
2             3             25        0        0        1          2

种类:

>>> combined.sort(list(combined.columns[:2]), inplace=True)
>>> combined
   x-coordinate   y-coordinate   data 1   data 2   data 3   data 4  
0             1             10       20        0        7          8
2             3             25        0        0        1          2
1             5             15        1        2        0          0

最后写出:

>>> combined.to_csv("coord_merged.csv",index=False)
>>> !cat coord_merged.csv
x-coordinate, y-coordinate, data 1, data 2, data 3, data 4  
1.0,10.0,20.0,0.0,7.0,8.0
3.0,25.0,0.0,0.0,1.0,2.0
5.0,15.0,1.0,2.0,0.0,0.0

如果保持整数格式很重要,那么

>>> combined.astype(int).to_csv("coord_merged.csv",index=False)
>>> !cat coord_merged.csv
x-coordinate, y-coordinate, data 1, data 2, data 3, data 4  
1,10,20,0,7,8
3,25,0,0,1,2
5,15,1,2,0,0

会做的。

于 2013-04-01T17:06:35.943 回答