I am using Python to process data from very large text files (~52GB, 800 million lines each with 30 columns of data). I am trying to find an efficient way to find specific lines. Luckily the string is always in the first column.
The whole thing works, memory is not a problem (I'm not loading it, just opening and closing the file as needed) and I run it on a cluster anyway. Its more about speed. The script takes days to run!
The data looks something like this:
scaffold126 1 C 0:0:20:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
scaffold126 2 C 0:0:10:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
scaffold5112 2 C 0:0:10:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
scaffold5112 2 C 0:0:10:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
and I am searching for all the lines that start with a particular string from the first column. I want to process the data and send a summary to a output file. Then I search for all the lines for another string and so on...
I am using something like this:
for (thisScaff in AllScaffs):
InFile = open(sys.argv[2], 'r')
for line in InFile:
LineList = line.split()
currentScaff = LineList[0]
if (thisScaff == currentScaff):
#Then do this stuff...
The main problem seems to be that all 800 million lines have to be looked through to find those that match the current string. Then once I move to another string, all 800 have to be looked through again. I have been exploring grep options but is there another way?
Many thanks in advance!