I have the following two types of txt files:
File1
Sample1012, Male, 36, Stinky, Bad Hair
Sample1043, Female, 28, Hot, Short Hair, Hot Body, Hates Me
Sample23905, Female, 42, Cougar, Long Hair, Chub
Sample123, Male, 32, Party Guy
File2
DEAD, Sample123, Car Accident, Drunk, Dumb
ALIVE, Sample1012, Alone
ALIVE, Sample23905, STD
DEAD, Sample1043, Too Hot, Exploded
I just want to write a simply Python script to join these files based on the sample field but keep running into a problem with the random number of data columns. For instance, I end up with:
Sample1012, Male, 36, Stinky, Bad Hair, ALIVE, Sample1012, Alone
Sample1043, Female, 28, Hot, Short Hair, Hot Body, Hates Me, DEAD, Sample1043, Too Hot, Exploded
Sample23905, Female, 42, Cougar, Long Hair, Chub, ALIVE, Sample23905, STD
Sample123, Male, 32, Party Guy, DEAD, Sample123, Car Accident, Drunk, Dumb
When what I want is:
Sample1012, Male, 36, Stinky, Bad Hair, EMPTY COLUMN, EMPTY COLUMN, ALIVE, Sample1012, Alone
Sample1043, Female, 28, Hot, Short Hair, Hot Body, Hates Me, DEAD, Sample1043, Too Hot, Exploded
Sample23905, Female, 42, Cougar, Long Hair, Chub, EMPTY COLUMN, ALIVE, Sample23905, STD
Sample123, Male, 32, Party Guy, EMPTY COLUMN, EMPTY COLUMN, EMPTY COLUMN, DEAD, Sample123, Car Accident, Drunk, Dumb
I'm basically just reading in both files with .readlines() and then comparing the relevant column with the sample ID with a simple "==" and if true then it prints out the line from the first file and the the second.
Not sure how to use len() to determine the max number of columns in file1 so that I can account for that at the end of each line if it is not the max number of columns before appending the line from the other file (provided the "==" is true).
Any help greatly appreciated.
UPDATE:
This is what I got now:
import sys
import csv
usage = "usage: python Integrator.py <table_file> <project_file> <outfile>"
if len(sys.argv) != 4:
print usage
sys.exit(0)
project = open(sys.argv[1], "rb")
table = open(sys.argv[2], "rb").readlines()
outfile = open(sys.argv[3], "w")
table[0] = "Total Table Output \n"
newtablefile = open(sys.argv[2], "w")
for line in table:
newtablefile.write(line)
projectfile = csv.reader(project, delimiter="\t")
newtablefile = csv.reader(table, delimiter="\t")
result = []
for p in projectfile:
print p
for t in newtablefile:
#print t
if p[1].strip() == t[0].strip():
del t[0]
load = p + t
result.append(load)
for line in result:
outfile.write(line)
outfile.close()
Can't get the for loops to work together - don't mind the dumb stuff at the stop. one of the files has a blank first line.