0

我们如何根据 TSV 文件中的列索引解析数据?一旦我们从文件中读取数据,那么我们必须检查第 0 列第 1 行数据和第 0 列第 2 行数据,如果匹配,则获取第 1 列第 1 行数据,并且需要在第 1 列第 1 行中附加所有匹配条目。

例如,SystemType.tsv 文件

Actrius  1990s drama films 
Actrius  Catalan language films 
Actrius  Spanish films 
Actrius  Barcelona in fiction 
Actrius  Films directed by Ventura Pons 
Actrius  1996 films 
An_American_in_Paris     Compositions by George Gershwin 
An_American_in_Paris     Symphonic poems 
An_American_in_Paris     Grammy Hall of Fame Award recipients 

在第 0 列第 1 行中存在“Actrius”,因此我们需要比较第 0 列中的所有行,并将匹配的条目第 1 列值以逗号分隔的形式放置,如下所示。

输出:

Actrius   1990s drama flims,Cataln language flims,Spanish flims,Barcelona in fiction,Films directed by Ventura Pons,1996 films
An_American_in_Paris    Compositions by George Gershwin,Symphonic poems,Grammy Hall of Fame Award recipients

我已经尝试过这个,但对我不起作用。

def finalextract():
    lines_seen = set()
    outfile = open("Output.txt","w+")
    infile = open("SystemType.tsv","r+")
    for line in infile:
        if line[0] == lines_seen[0]:
            string = line[1]+','+lines_seen[1]
            outfile.write(string)
            lines_seen.add(string)
    infile.close()
    outfile.close()

4

1 回答 1

0

这是我想出的(Python 3,但我认为唯一的区别应该是我的打印功能。from __future__ import print_function如果你想用它来写入输出文件,你可以):

import collections

# I used variable "input" to hold the string from your example .tsv contents;
# you'd really want to read it in from a file.

D = collections.OrderedDict()
for line in input.splitlines():
    key, value = line.split('\t')
    if key not in D:
        D[key] = []
    D[key].append(value.strip())

for key, values in D.items():
    print(key, ','.join(values), sep='\t')

我的输出是:

Actrius 1990s drama films,Catalan language films,Spanish films,Barcelona in fiction,Films directed by Ventura Pons,1996 films
An_American_in_Paris    Compositions by George Gershwin,Symphonic poems,Grammy Hall of Fame Award recipients
于 2013-06-11T06:08:06.190 回答