python - python中的行操作

Question

我正在尝试将 csv 转换为动态 gephi 图的 .gexf 格式文件。这个想法是让属性数据中包含所有平行边（具有相同源和目标但不同发布日期的边）。在示例中，属性中的所有日期都对应于 John 在在线课程的论坛中回答 Jan 的问题的发布日期。

如何获得如下所示的 csv：

Jan John    2012-04-07  2012-06-06
Jan Jason   2012-05-07  2012-06-06
Jan John    2012-03-02  2012-06-07
Jan Jason   2012-03-20  2012-06-08
Jan Jack    2012-03-26  2012-06-09
Jan Janet   2012-05-01  2012-06-10
Jan Jack    2012-05-04  2012-06-11
Jan Jason   2012-05-07  2012-06-12
Jan Jack    2012-05-09  2012-06-13
Jan John    2012-05-15  2012-06-14
Jan Janet   2012-05-15  2012-06-15
Jan Jason   2012-05-20  2012-06-16
Jan Jack    2012-05-23  2012-06-17
Jan Josh    2012-05-25  2012-06-18
Jan Jack    2012-05-28  2012-06-19
Jan Josh    2012-06-01  2012-06-20

变成如下格式：

<edge source="Jan" target="John" start="2012-02-20" end="2012-06-06" weight="1" id="133">
        <attvalues>
          <attvalue for="0" value="1" start="2012-04-07" end="2012-06-06"/>
          <attvalue for="0" value="2" start="2012-06-06" end="2012-06-06"/>
          <attvalue for="0" value="3" start="2012-06-06" end="2012-06-06"/>
        </attvalues>
 </edge>
<next edge...
</next edge>

我尝试过的方法效果不佳。我尝试创建两个列表，对于第一个列表中的每个条目，搜索以找到第二个列表中前两个条目的匹配项。如果匹配，那么我的脚本将删除第二个列表中的行并附加这对日期。每行代表提问者和回答者之间对应的完整数量，然后我会编写一个脚本来将该行转换为边缘/属性数据。我一直在使用它作为某种指南。

score 3 · Accepted Answer

看看Python pandas项目，它旨在简化这种操作。它如何对您的数据进行分组和解析的示例......

# Load your CSV as a pandas 'DataFrame'.
In [13]: df = pd.read_csv('your file', names=['source', 'target', 'start', 'end'])

# Look at the first few rows. It worked.
In [14]: df.head()
Out[14]: 
  source target       start         end
0    Jan  Jason  2012-05-07  2012-06-06
1    Jan   John  2012-03-02  2012-06-07
2    Jan  Jason  2012-03-20  2012-06-08
3    Jan   Jack  2012-03-26  2012-06-09
4    Jan  Janet  2012-05-01  2012-06-10

# Group the rows by the the name columns. Each unique pair gets its own group.
In [15]: edges = df.groupby(['source', 'target'])

In [16]: for (source, target), edge in edges: # consider each unique name pair an edge
    print source, target
    for _, row in edge.iterrows(): # loop through all the rows belonging to these names
        print row['start'], row['end']
   ....:         
Jan Jack
2012-03-26 2012-06-09
2012-05-04 2012-06-11
2012-05-09 2012-06-13
2012-05-23 2012-06-17
2012-05-28 2012-06-19
Jan Janet
2012-05-01 2012-06-10
2012-05-15 2012-06-15
Jan Jason
2012-05-07 2012-06-06
2012-03-20 2012-06-08
2012-05-07 2012-06-12
2012-05-20 2012-06-16
Jan John
2012-03-02 2012-06-07
2012-05-15 2012-06-14
Jan Josh
2012-05-25 2012-06-18
2012-06-01 2012-06-20

剩下的就是用您的 XML 详细说明这些打印语句，并可能输出到文件而不是打印。

python - python中的行操作

1 回答 1

Related

Reference