2

I have the following two types of txt files:

File1

Sample1012, Male, 36, Stinky, Bad Hair
Sample1043, Female, 28, Hot, Short Hair, Hot Body, Hates Me
Sample23905, Female, 42, Cougar, Long Hair, Chub
Sample123, Male, 32, Party Guy

File2

DEAD, Sample123, Car Accident, Drunk, Dumb
ALIVE, Sample1012, Alone
ALIVE, Sample23905, STD
DEAD, Sample1043, Too Hot, Exploded

I just want to write a simply Python script to join these files based on the sample field but keep running into a problem with the random number of data columns. For instance, I end up with:

Sample1012, Male, 36, Stinky, Bad Hair, ALIVE, Sample1012, Alone
Sample1043, Female, 28, Hot, Short Hair, Hot Body, Hates Me, DEAD, Sample1043, Too Hot, Exploded
Sample23905, Female, 42, Cougar, Long Hair, Chub, ALIVE, Sample23905, STD
Sample123, Male, 32, Party Guy, DEAD, Sample123, Car Accident, Drunk, Dumb

When what I want is:

Sample1012, Male, 36, Stinky, Bad Hair, EMPTY COLUMN, EMPTY COLUMN, ALIVE, Sample1012, Alone
Sample1043, Female, 28, Hot, Short Hair, Hot Body, Hates Me, DEAD, Sample1043, Too Hot, Exploded
Sample23905, Female, 42, Cougar, Long Hair, Chub, EMPTY COLUMN, ALIVE, Sample23905, STD
Sample123, Male, 32, Party Guy, EMPTY COLUMN, EMPTY COLUMN, EMPTY COLUMN, DEAD, Sample123, Car Accident, Drunk, Dumb

I'm basically just reading in both files with .readlines() and then comparing the relevant column with the sample ID with a simple "==" and if true then it prints out the line from the first file and the the second.

Not sure how to use len() to determine the max number of columns in file1 so that I can account for that at the end of each line if it is not the max number of columns before appending the line from the other file (provided the "==" is true).

Any help greatly appreciated.

UPDATE:

This is what I got now:

import sys
import csv

usage = "usage: python Integrator.py <table_file> <project_file> <outfile>"
if len(sys.argv) != 4:
    print usage
    sys.exit(0)

project = open(sys.argv[1], "rb")
table = open(sys.argv[2], "rb").readlines()
outfile = open(sys.argv[3], "w")

table[0] = "Total Table Output \n"

newtablefile = open(sys.argv[2], "w")
for line in table:
    newtablefile.write(line)

projectfile = csv.reader(project, delimiter="\t")
newtablefile = csv.reader(table, delimiter="\t")

result = []

for p in projectfile:
    print p
    for t in newtablefile:
        #print t
        if p[1].strip() == t[0].strip():
            del t[0]
            load = p + t
            result.append(load)


for line in result:
    outfile.write(line)

outfile.close()

Can't get the for loops to work together - don't mind the dumb stuff at the stop. one of the files has a blank first line.

4

4 回答 4

1

不确定您建议的输出中的“空列”来自哪里......如果这些列应该与定义的模式匹配,那么您必须在输入文件中有空白点。否则,这将工作...

import csv


f1 = open("test1.txt", 'rb')
reader1 = csv.reader(f1)
f2 = open("test2.txt", 'rb')
reader2 = csv.reader(f2)
result = []

for entry in reader1:
    print entry
    for row in reader2:
        print row
        if entry[0].strip() == row[1].strip():
            del row[1]
            load = entry + row
            result.append(load)

for line in result:
    print line

编辑 -

如果您需要跳过其中一个文件中的一行,您只需执行 reader1.next() 即可将指针移动到下一行输入。

您的示例创建了一个输出文件,向其中写入数据,然后尝试读取它而无需关闭文件并重新打开它,或者将其打开为可读可写......我不能发誓,但我认为这可能是你的问题。幸运的是,无论如何您都不需要使用 .next() 方法来完成所有这些工作。

于 2013-09-18T03:43:36.970 回答
0

嗯,您可能应该使用 rdbms 来提高效率,但您可以使用字典来更好地做到这一点。

当您readline()在第一个逗号上使用时,只需将第一个逗号之前的所有内容拆分并将其用作键,值作为列表。

所以像

{'Sample1012': ['Sample1012', 'Male', 36, 'Stinky', 'Bad Hair']}

现在您可以做的与其他文件相同

简单地说,

for key in dict1.keys:
    dict1[key] += dict2.get(key, [])

然后这会将所有相应的内容附加到第一个字典中。

这只是让你的生活更轻松

于 2013-09-18T03:25:27.413 回答
0
with open('file1') as f1, open('file2') as f2:
    dic = {}
    #Store the data from file2 in a dictionary, with second column as key
    for line in f2:
        data = line.strip().split(', ')
        key = data[1]
        dic[key] = data
    #now iterate over each line in file1
    for line in f1:
        data = line.strip().split(', ')
        #number of empty columns = `(7-len(data))`
        data = data + ['EMPTY COLUMN']*(7-len(data))
        print '{}, {}'.format(", ".join(data), ', '.join(dic[data[0]]))

输出:

Sample1012, Male, 36, Stinky, Bad Hair, EMPTY COLUMN, EMPTY COLUMN, ALIVE, Sample1012, Alone
Sample1043, Female, 28, Hot, Short Hair, Hot Body, Hates Me, DEAD, Sample1043, Too Hot, Exploded
Sample23905, Female, 42, Cougar, Long Hair, Chub, EMPTY COLUMN, ALIVE, Sample23905, STD
Sample123, Male, 32, Party Guy, EMPTY COLUMN, EMPTY COLUMN, EMPTY COLUMN, DEAD, Sample123, Car Accident, 
于 2013-09-18T03:29:57.587 回答
0

您可以将整个文件放入列表列表中,然后使用以下方法查找最大字段数:

file1 = open("file1.txt")
list1 = [s.split(",") for s in file1]
file1.close()
maxlen1 = max([len(x) for x in list1])

字典是查找第二个文件的最佳结构

file2 = open("file2.txt")
dict2 = { }
for line2 in file2:
    cols2 = line2.split(",")
    dict2[cols2[1]] = cols2
file2.close()

现在,如果cols1是 list1 中的任何列列表,那么您可以使用:

cols3 = cols1 + (maxlen1 - len(cols1))*[" EMPTY COLUMN"] + dict2[cols1[0]]

...根据需要创建一个用“ EMPTY COLUMN”值填充的列表。现在您可以使用以下命令将其转换回单个字符串:

",".join(cols3)

我没有尝试修剪字符串,因此您最终会在逗号后得到与以前相同的空格。有一个小问题,在“DEAD”、“ALIVE”等之前没有空格。您可以在创建 dict2 或提取形成 cols3 时进行更改。

也没有文件 I/O 错误处理。片段是片段。

于 2013-09-18T03:35:43.327 回答