3

我有一个以下格式的二维字典:

myDict = {('a','b'):10, ('a','c'):20, ('a','d'):30, ('b','c'):40, ('b','d'):50,('c','d'):60}

如何将其写入制表符分隔的文件中,以便该文件包含以下内容。在填充元组 (x, y) 时,将填充两个位置:(x,y) 和 (y,x)。(x,x) 始终为 0。

输出将是:

    a   b   c   d
a   0   10  20  30
b   10  0   40  50
c   20  40  0   60
d   30  50  60  0 

PS:如果可以以某种方式将字典转换为数据框(使用熊猫),那么可以使用熊猫函数轻松将其写入文件

4

4 回答 4

7

You can do this with the lesser-known align method and a little unstack magic:

In [122]: s = Series(myDict, index=MultiIndex.from_tuples(myDict))

In [123]: df = s.unstack()

In [124]: lhs, rhs = df.align(df.T)

In [125]: res = lhs.add(rhs, fill_value=0).fillna(0)

In [126]: res
Out[126]:
    a   b   c   d
a   0  10  20  30
b  10   0  40  50
c  20  40   0  60
d  30  50  60   0

Finally, to write this to a CSV file, use the to_csv method:

In [128]: res.to_csv('res.csv', sep='\t')

In [129]: !cat res.csv
        a       b       c       d
a       0.0     10.0    20.0    30.0
b       10.0    0.0     40.0    50.0
c       20.0    40.0    0.0     60.0
d       30.0    50.0    60.0    0.0

If you want to keep things as integers, cast using DataFrame.astype(), like so:

In [137]: res.astype(int).to_csv('res.csv', sep='\t')

In [138]: !cat res.csv
        a       b       c       d
a       0       10      20      30
b       10      0       40      50
c       20      40      0       60
d       30      50      60      0

(It was cast to float because of the intermediate step of filling in nan values where indices from one frame were missing from the other)

@Dan Allan's answer using combine_first is nice:

In [130]: df.combine_first(df.T).fillna(0)
Out[130]:
    a   b   c   d
a   0  10  20  30
b  10   0  40  50
c  20  40   0  60
d  30  50  60   0

Timings:

In [134]: timeit df.combine_first(df.T).fillna(0)
100 loops, best of 3: 2.01 ms per loop

In [135]: timeit lhs, rhs = df.align(df.T); res = lhs.add(rhs, fill_value=0).fillna(0)
1000 loops, best of 3: 1.27 ms per loop

Those timings are probably a bit polluted by construction costs, so what do things look like with some really huge frames?

In [143]: df = DataFrame({i: randn(1e7) for i in range(1, 11)})

In [144]: df2 = DataFrame({i: randn(1e7) for i in range(10)})

In [145]: timeit lhs, rhs = df.align(df2); res = lhs.add(rhs, fill_value=0).fillna(0)
1 loops, best of 3: 4.41 s per loop

In [146]: timeit df.combine_first(df2).fillna(0)
1 loops, best of 3: 2.95 s per loop

DataFrame.combine_first() is faster for larger frames.

于 2013-10-08T22:05:03.530 回答
6
In [49]: data = map(list, zip(*myDict.keys())) + [myDict.values()]

In [50]: df = DataFrame(zip(*data)).set_index([0, 1])[2].unstack()

In [52]: df.combine_first(df.T).fillna(0)
Out[52]: 
    a   b   c   d
a   0  10  20  30
b  10   0  40  50
c  20  40   0  60
d  30  50  60   0

对于后代:如果您只是在调整,请查看下面 Phillip Cloud 的答案,以获得更简洁的构建方式df

于 2013-10-08T22:04:00.730 回答
1

没有我想要的那么优雅(也没有使用熊猫),但直到你找到更好的东西:

adj = dict()
for ((u, v), w) in myDict.items():
  if u not in adj: adj[u] = dict()
  if v not in adj: adj[v] = dict()
  adj[u][v] = adj[v][u] = w
keys = adj.keys()

print '\t' + '\t'.join(keys)
for u in keys:
  def f(v):
    try:
      return str(adj[u][v])
    except KeyError:
      return "0"
  print u + '\t' + '\t'.join(f(v) for v in keys)

或等效地(如果您不想构造邻接矩阵):

k = dict()
for ((u, v), w) in myDict.items():
  k[u] = k[v] = True
keys = k.keys()

print '\t' + '\t'.join(keys)
for u in keys:
  def f(v):
    if (u, v) in myDict:
      return str(myDict[(u, v)])
    elif (v, u) in myDict:
      return str(myDict[(v, u)])
    else:
      return "0"
  print u + '\t' + '\t'.join(f(v) for v in keys)
于 2013-10-08T21:47:42.320 回答
-2

让它使用pandas包工作。

#Find all column names 
z = []
[z.extend(x) for x in myDict.keys()]
colnames = sorted(set(z))

#Create an empty DataFrame  using pandas 
myDF  =  DataFrame(index= colnames, columns = colnames )
myDF  =  myDF.fillna(0) #Initialize with zeros
#Fill each item one by one 
for val in myDict:
    myDF[val[0]][val[1]] = myDict[val]
    myDF[val[1]][val[0]] = myDict[val]

#Write to a file 
outfilename = "matrixCooccurence.txt"
myDF.to_csv(outfilename, sep="\t", index=True, header=True, index_label = "features" )
于 2013-10-08T22:08:14.160 回答